Identifying Structural Changes in Genome with Short-Read Sequencing Data (12/12/2015)


Sat, Dec 12, 2015 at 8:42 AM

Customer: We plan to obtain 12 yeast samples that we expect will develop a reversible structural change somewhere (unknown size, location etc). We will obtain the Miseq paired end reads from each at large coverage.

Would you be able to assemble de novo, the genomes of these strains. Again, we are looking for a structural change, so de novo assembly is required. We expect the genomes may alternate with respect to a gene conversion of some type, or a DNA "flip", but a SNP is not out of the question - just less likely.

Sat, Dec 12, 2015 at 3:57 PM

AccuraScience LB: This is an exciting project. If the de novo assembly approach is to be taken, running the computational tools (e.g., AllPaths and SOAPdenovo) to obtain the assemblies is the easier part of the work. Examining the assemblies manually to identify points of misassembly would be the more difficult (and labor-intensive) part. All genome assembly tools will produce errors, and your project has very low tolerance for these errors, which is what makes it special.

I am curious whether you have considered the alternative approach that involves mapping of the reads to the reference genome, then trying to identify structural variants based on changing read depth coverage, informative junction reads (linking distant regions together) and/or abnormal distances between the two ends of the reads (implying insertions/deletions)?

Sat, Dec 12, 2015 at 10:44 PM

Customer:We considered trying to map to the known reference sequence, but were informed that the Illumina technology will simply trash any reads that don't fit, except for SNPs. This is why we felt that de novo assembly would be essential. We felt that the inherent error rate in whole genome sequencing could be eliminated by 100 fold coverage, or more.

Do you think the comparison to a reference genome is sufficient to identify structural changes - or perhaps a combination of both approaches (de novo and comparison to the reference) would be important?

Sun, Dec 13, 2015 at 10:03 AM

AccuraScience LB: The simplistic, "default" mapping-based analysis pipeline looks at SNPs only, but more "advanced" mapping-based analysis pipelines can identify structural variants (SVs). There are a total of 3 strategies, which I briefly went over in last mail. Let me explain them a little more: (1) Looking at read depth change: if an insertion has happened, the corresponding genomic region will have higher depth coverage of reads. Similarly, if a deletion has happened, then we would expect a lower coverage in the corresponding genomic region. (2) Looking at junction reads, that is, reads whose one part maps to one genomic location, but another part of it maps to a different genomic location. It is an indication that a SV (e.g., a translocation, duplication or inversion) has taken place. (3) Looking at the distance between the locations to which the two ends of the pair-ended reads map. If the distance is substantially different from what's expected, that's an indication that a SV (e.g., insertion or deletion) has taken place.

These 3 strategies, in addition to the de novo assembly strategy that you raised, are the 4 general approaches to tackle SVs using short-read deep sequencing experiments.

This is explained and illustrated in Figure 2 of his review article http://www.ncbi.nlm.nih.gov/pubmed/21358748.

Finding SVs based on deep sequencing data is always a challenging task. It would be a good idea to try more than one of these approaches and cross-compare the results.

If I am to suggest, I would be against choosing de novo assembly as one of the primary approaches to go first, because, as discussed in last mail, this approach require a lot of painstaking manual examination work: when the number of differences between the assembly and the reference genome reaches the level of say 20-30, there is no other way but to look at the evidence manually to determine whether the difference is "real".

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer's privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.