Mon, 09/15/2014
Mon, 09/15/2014 at 12:30 AM
Customer: We're interested in metagenomic analysis of different bacteria and viruses from wild animals. A reference genome isn't always available, but we need to strip the host material away and try to assemble contigs for viruses, fungi, and bacteria of interest.
Mon, 09/15/2014 at 5:43 PM
AccuraScience LB: (1) Although we could attempt to remove reads from the host as much as possible, it is important to understand that computational doing this is challenging, thus if possible, effort needs to be made to remove host contamination from the sample as best as you could. Some selective filtration, centrifuging and even flow cytometry-based methods could be attempted. Hope you have thought through these in your project design.
(2) We will always attempt to assemble the microbial genomes using the sequencing data, but it needs to be noted that (a) although some specialized assembling methods have been developed just for metagenomics purposes, e.g., MetaVelvet, , Meta-IDBA, this task is intrinsically much more challenging than assembling a genome with pure sample out of a single species. Only in rare cases can a genome be assembled with comparable quality as in pure sample cases. Most of the time, the best we could expect is a number of truncated contigs. (b) For some purposes, e.g., quantification of gene expression related to particular functions, it would make more sense to use original sequencing reads directly, rather than the assembled contigs.
Generally, there are two broad purposes for metagenomics studies: (a) to identify taxonomic compositions of the samples, and (b) to characterize the samples functionally, i.e., with focus on a particular type of enzyme functions. We would like to know a little more about the objectives of your project. If it is more about taxonomic characterization, then it might be useful to do 16s rDNA analysis (for bacterial genomes; for viruses, I am not sure and will have to look up the literature).
Could you tell us what level of annotation is required for the assembled genomes/contigs? And what kind of binning or clustering (sequence-based or composition-based) would be needed in the pipeline developed?
Tue, 09/16/2014 at 8:45 PM
Customer: Regarding removal of host reads, we've done filtration and differential centrifugation, but still end up with a fair amount of host material.
For the bacterial side, we can bin by 16S, but there isn't a great approach for viruses (that I've come across). There have been a couple of issues. One is our library prep appeared to produce a fair amount of chimeras, which are a nightmare to identify and remove (though Blastx seems to be the best, though lengthy approach). The other is the swamping out of interesting reads. There appears to be a number of virus families, but some of the reads are short and we don't know how 'true' they are.
To answer you additional points,
1) Yes, we are interested in a metagenomic study, with the intention of going back to our sample bank and screening them with what we've found in the NGS results.
2) If you could bin the reads by virus families or by bacterial 16S, that would be fine. Then we can use MEGAN or construct phylogenies for what we have.
Wed, 09/17/2014 at 4:56 PM
AccuraScience LB: The information you have provided is adequate for us to define the general scope of what we could do to help in this project. My understanding is that it includes the following components:
(1) Developing and implementing methods for sequencing data processing, including (a) eliminating or reducing (as much as possible) contamination of host species (I hope the reference genome of the host species is available, but if it is not, we might have to try to use a related species with careful examination of effectiveness), and (b) evaluating and "dealing with" chimera reads (which, as you said, could be a big nightmare - if needed, we could try to develop some methods to split the reads).
(2) Performing de novo assembly of the sequencing data, using 2 or more assemblers developed to handle metagenomics reads, with the hope to obtain a set of contigs with reasonable quality.
(3) For bacterial species, attempting 16s rDNA-based methods for taxonomic analysis - 16s rDNA genes typically account for one per a few thousand genes in a metagenomic sample, which could translate to ~10,000 reads per Hiseq lane worth of data. Downside for Illumina data is their short length. But I think this is worth trying if bacteria species are part of your interest.
(4) For both bacteria and viral species, attempting phylogenetic analysis based on the contig data, using tools such as PAML and SEMPHY.
A clarification: the term "binning" used in metagenomic analysis refers to unsupervised clustering-type of analysis of the sequence data directly, without reference genomes. More of this can be discussed as we move on.
Back to Other Selected Recent Inquiries
Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.