09/18/2014
Customer describes a metagenomics project characterizing the gut genomes of fish for purposes of identifying novel bacterial genes involved in a specific metabolic pathway with potential industrial utility. He is also interested in the taxonomic compositions of the fish gut metagenomes. After a few rounds of discussion about the specifics of the project, AccuraScience submits a proposal.
Thu, 09/18/2014 at 5:23 PM
AccuraScience LB: We would propose that AccuraScience do the following for this project: (1) Attempt de novo assembly using 2 or more assemblers specific for metagenomics use, e.g., MetaVelvet and Meta-IDBA. Only in very rare cases, this effort produces quality full genome assemblies. It is more reasonable to expect a series of contigs out of this analysis, which will be needed for some (but not all) downstream analysis steps.
(2) Attempt binning-based methods, e.g., Phylopythia and Megan, to bin or cluster the reads into classes then map them to existing bacterial reference genomes - using pipelines such as MG-RAST, for purposes of determining compositions of the samples sequenced. Comparison can be made between the two samples.
(3) (Optionally) There is chance that there are adequate 16s rDNA reads in the sequencing data that we can attempt to do some taxonomic analysis, then compare the result with that of (2). In a typical metagenomic sample, one in a few thousand genes corresponds to a rDNA gene. Thus, among a Hiseq lane worth fo sequencing data, there should be between 10,000 and 100,000 reads corresponding to 16s rDNA reads, which may or may not be adequate for 16s rDNA-based analysis. The downside of Illumina data is its short read length.
(4) Attempt gene prediction, using tools such as FragGeneScan, followed by functional annotation using RAST and BLASTX. There is uncertainty involving computational power required if we do it the BLASTX way, thus something might need to be developed to twist this around a little. If with some effort it still computational too costly to do, we might have to switch to a more focused approach, and - with your help - collect information of as many chitin-degradation genes as possible from existing resources, and focus on annotate genes sharing similarities to those genes in the metagenomic samples.
Tue, 09/23/2014 at 4:25 AM
Customer: For some of the samples our interests are in other genes related to amino acid metabolism. And, what would you suggest that we do with the optional step (3)?
Tue, 09/23/2014 at 9:16 AM
AccuraScience LB: The functional annotation of the genes in the samples is challenging not because it involves any complicated procedure - in fact, the procedure is quite simple, just invoking Blast to map the sequences against all known sequences that are potentially relevant. But this procedure could take a very long time to complete, thus it is the computational time that could be a killer for this step of the work. The brute force way of doing this, which involves mapping of all sequences against all known microbial sequences, could take a month or longer to complete for one sample. It might be tolerable to do it this way for one or even two samples, but it is certainly not scalable option for any larger sample number. Thus, we have to develop some more "intelligent" way of handling this. If you guys can work together with us to narrow down the list of known sequences to map the data against - for different subsets of the samples, these genes could be genes involved in different metabolic pathways - this will drastically reduce the computational time required, and make this doable at a reasonable time frame.
About the 16s rDNA work, we plan to give it a try anyway. Wee labeled this as "optional" task because it involves higher level of risk than what we felt comfortable quoting: although there are reports suggesting that it could work, most people in the field do not do it this way.
Back to Other Selected Recent Inquiries
Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.