Annotation of a Higher Eukaryotic Genome (9/19/2014)


Fri, 09/19/2014

Customer has a non-model organism's genome sequenced and assembled in another facility, and inquires what we would do to annotate the genome. Meanwhile, he also has RNA-seq data for multiple tissues for the same species, and asks about a back-up plan in case the genome assembly work of the other facility does not generate good enough assemblies.

Fri, 09/19/2014 at 5:49 PM

AccuraScience LB: The work we would propose includes the following:

(1) Evaluating quality of the assembled genome data, and if quality is not acceptable, proceeding to (5).

(2) Carrying out "computational phase" of the annotation procedure, including: (a) Repeat identification and masking by RepMask, (b) Evidence alignment: align the RNA-seq data provided to the assembly using splice-aware aligners such as SPlign and Spidey. (c) ab initio gene prediction, using Augustus, SNAP, and/or GeneMark-ES, (d) Evidence-driven gene prediction, incorporating the RNA-seq data, using TwinScan, FGENESH and/or GAZE.

(3) Carrying out "annotation phase" of the annotation procedure, which involves running of "chooser" programs such as JIGSAW and EvidenceModeler. These "chooser" programs evaluates and integrate gene prediction/annotation results from multiple sources that contain conflicting predictions, and produce more reliable gene model annotation.

(4) Performing quality control procedure, and produce annotation results in formats suitable for visualization (GenBank, GFF3 or GTF).

If the quality of the genome assembly is determined to be not adequate for annotation, the work following task (1) would include:

(5) Performing de novo assembly of the transcriptome, using the RNA-seq data provided.

(6) Carrying out BLAST-based functional annotation of the transcriptome (expressed genes).

Additional note: Despite the fact that this is an engineering-oriented project, some portions of the work involve trial-and-error style testing, and comparison of multiple tools with similar functionality for purposes of determination of "optimal" procedure suitable for the particular datasets. Thus, the description of the tasks should be considered as a general guideline that defines the scope of the work, and not to be taken as procedures to be followed strictly. In other words, over the course of the project, we may choose to use alternative tools or methods that would work better or more properly - according to our testing - than those listed in the description, to achieve the same general goals as defined in the description of the tasks.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.