Mon, 12/02/2013 at 11:17 AM
Customer is a microbiologist. He wants to identify a few male-specific marker genes i.e., those specifically expressed in males, in a fish species, which is not a model organism.
Mon, 12/02/2013 at 12:13 PM
AccuraScience LB: The genome sequence of this fish species is not yet available, thus it would take de novo assembly to build the transcriptome based on your RNA-seq data. Once we get the assembled transcriptome, we can assign functional annotations (gene names, and GO annotation, if this is desired) of the genes/transcripts. We can look for genes that are differentially expressed between the male and female samples. Depending on what organs/tissues you used for the RNA-seq experiments, this may or may not be meaningful to do.
Some EST data might be available for the species. It might be useful to map the EST data with the assembled transcriptome (or vice versa) to see if there is something out of the ordinary.
Mon, 12/02/2013 at 12:28 PM
Customer: Would de novo assembly still be required for the above? My understanding is that we can just subtract one data set from the other.
Mon, 12/02/2013 at 1:02 PM
AccuraScience LB: The subtraction would need to be done on the gene/transcript level, and what you have got from the sequencer are short-read data. It would require assembly to produce gene/transcript data from the short-read data. If the reference genome were available, a mapping-based strategy would have worked.
If the available EST data are of related tissues, we might try to give a mapping-based method a try, using the ESTs as a reference, but I am not positive this would work well - there would be a lot of unmappable reads remaining, which would still require de novo assembly...
Tue, 12/03/2013 at 9:44 AM
Customer: Are you recommending 454 for the RNA sequencing? We could do a half plate with each pool uniquely tagged. That would be about a quarter million reads for each male and female pool.
Tue, 12/03/2013 at 12:02 PM
AccuraScience LB: 454 does offer longer reads, which eases assembly-based analysis. Not many people are doing 454 for RNA sequencing, but it is doable - however, 454 experiments are less cost-effective than Illumina, and there is the additional concern that Roche is discontinuing it.
If you do decide to use 454, then a quarter million reads for each pool would be reasonable.
Tue, 12/03/2013 at 1:29 PM
Customer: What would you recommend we do here to generate the appropriate data that you can help us identify male specific transcripts in the fish? Which Illumina platform provides sufficient read lengths?
Tue, 12/03/2013 at 2:25 PM
AccuraScience LB: Illumina is the most popular NGS platform, and each of the 3 current models (GAII, Hiseq - including Hiseq 1000 through Hiseq 2500, and Miseq) produces very similar data that can be analyzed in almost the same way, and they are all capable of generating adequate amount of data. The three models differ in throughput, per-base cost and sequencing time. Of the three, Hiseq is the winner in multiple aspects - it has the highest throughput, lowest per-base cost, the sequencing speed is high enough, and there are many centers/core facilities that have Hiseq, making shopping around easier.
For the purpose of identifying a few dozen gender-specific genes, sequencing 20-40 million reads per sample (that is, male or female), with 100nt pair-ended reads should be adequate. This requires 1/4 to 1/2 lane of Hiseq (each Hiseq lane can get 120-160 million reads), You would need to order at least 1 lane at a time, so you could either increase the read number per gender to ~60 million, or pad in other samples from another project (and kill two birds with one stone), with barcoding (the male and female samples are barcoded too). One lane of Hiseq experiment would cost you $3000-4000. Miseq and GAII would work too, and Miseq can give longer read lengths (150-200nt), but they are not as easy to find, and the sequencing experiments are more costly (if you would want to generate the same amount of data).
Back to Other Selected Recent Inquiries
Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.