Genome Assembly for a RNA Virus (8/30/2014)


Sat, 08/30/2014

Customer says she has generated 28 Million reads which equates to about 20Gb of data. The DNA was extracted from stool samples x 80 and the viral genome amplified by performing overlapping rounds of PCR on an RNA virus. The library was made from the purified amplicons by using the Nextera XT sample preparation kit (Illumina). She need de novo assembly as there is a huge diversity in this RNA virus.

Sat, 08/30/2014 at 11:05 AM

AccuraScience LB: You've had 80 samples multiplexed together, amplified and sequenced using an Illumina sequencer (I assume it is a Miseq). 20Gb of data translate to an average depth coverage of 30000X for each of the 80 samples, as the genome size to be about 8000b according to the literature. The coverage is good, even considering the high level of variation in amplification efficiency across samples and across regions.

De novo assembly for RNA virus genomes is challenging precisely because of this high level of variation in amplification efficiency which leads to zigzaging depth coverage. Established assembly methods that assume constant depth coverage (SOAPdenovo, AllPaths, Velvet and Abyss) could run into trouble. An assembler recently developed at Broad Institute named VICUNA (http://www.ncbi.nlm.nih.gov/pubmed/22974120) was developed to address this difficulty. We would probably start with VICUNA, testing out the optimal setting for this tool. Meanwhile we will attempt a couple of other tools mentioned above - perhaps 1-2 with Overlap-layout-consensus strategy and 1-2 with using de Bruijn graph, benchmark the performance of VICUNA in the context of comparison with that for the other tools.

We also need to figure out a way to present the comparison across the ~80 assembled genomes. Perhaps we will define one of them as a reference genome, and align all others with it.

Other uncertainties involved in this project include ones associated with the sample preparation and sequencing data generation. In a published study similar to this one, 5 out of 28 samples did not render full genome assemblies.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.