Routine and Advanced RNA-seq and Exome Data Analysis Options (9/7/2014)

Sun, 09/07/2014

Customer asks what routine and advanced analyses of RNA-seq and exome sequencing data we can perform for researchers.

Sun, 09/07/2014 at 12:35 PM

AccuraScience LB: For RNA-seq analysis, the "routine" analysis pipeline includes (1) sequencing data quality control using FastQC, (2) Running of Cufflinks, which includes (a) Mapping of reads to the reference genome using TopHat (which is based on Bowtie), allowing identification of exon-exon junctions, (b) The core Cufflinks process, which assigns expression levels for each gene (or "locus" in Cufflinks' terms), transcript isoform and exon, (c) CuffDiff, which allows the identification of differentially expressed genes (or loci), transcript isoforms and exons, and (d) CuffCompare, which identifies loci, transcript isorms and exons that are present in one condition but absent in other conditions; if existing gene model data (e.g., from RefSeq, Ensembl, or UCSC known genes) are provided, CuffCompare also allows identification of "novel" genes or transcript isoforms.

Some "additional" analyses for RNA-seq data that are not included in the routine analysis pipeline include (1) functional annotation of novel genes/transcript isoforms identified, (2) informatics analysis for purposes of annotating long non-coding RNAs, (3) coding potential calculation for newly identified genes/transcript isoforms for characterization of long non-coding RNAs, (4) Pathway analysis for differentially expressed genes, based on GO or other Gene Set Enrichment Analysis (GSEA)-based methodologies, (5) Identification of gene fusion events - which is very relevant for some cancer-related transcriptome analysis projects.

For exome or other targeted sequencing data (including cancer panel-based resequencing data), the "routine" analysis pipeline includes: (1) sequencing data quality control using FastQC, (2) Running of the GATK pipeline, which includes (a) Mapping of sequencing reads to the reference genome using BWA, (b) Mapping quality score recalibration, (c) Smith-Waterman algorithm-based sequence realignment, for more reliable identification of Indels, (d) SNV and indel calling, and (e) (optionally) population-level SNV/indel calling refinement, including data imputation (if needed) and population genetics model-based calling refinement.

Some "additional" analyses for exome sequencing data that go beyond "routine" analysis include (1) characterization/prediction of functional consequences of variants identified, using tools such as ANNOVAR, SIFT, PolyPhen2, LTR, MutationTaster (and several others), (2) Pathway analysis for recurrent variants, which we were frequently asked to do for cancer-related projects, (3) for cancer-related project involving data at high depth coverage (>500X), calculation of mutant allele frequency, and based on which, the confidence of identified variants, (4) identification of copy number variations and/or structural variants - there are several established strategies to do this based on (i) depth, (ii) read-splitting, (iii) abnormality in distance between the two ends of the pair-ended reads and (iv) assembly of reads. (A word of caution: despite the soundness of these strategies, these methods do not work well in practice: typically, two tools applied to identify structural variations on the same sequencing dataset produce results that have have only ~20% agreements between them), (5) With high-quality, deep coverage of tumor resequencing data, "advanced" methods are available to characterize the subclonal structures of the cells and even derive the evolutionary history of how the subclonal structures have evolved, and (6) When other data (e.g., GWAS or transcriptome data) are available, it's possible to develop more sophisticated network analysis strategies for identifying critical clues hidden in the data that are important for the objectives of the study, e.g., what are driver mutations and how their changes led to the subclonal organizations observed.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.

FAQs

Support

Routine and Advanced RNA-seq and Exome Data Analysis Options (9/7/2014)