Gene and Pathway Analysis for Lung Cancer Samples (10/4/2014)


10/01/14

Wed, 10/01/2014 at 4:10 PM

Customer: The Director’s Challenge Lung Study (DCLS) is a dataset available to the public, hosted by the NCICB caIntegrator portal at https://caintegrator.nci.nih.gov/caintegrator/. This lung study comprises gene expression profiles acquired on Affymetrix microarray chips from more than 400 specimens of early-stage lung cancer, with associated clinical and pathological annotation. Our lab is interested in a transcription factor gene named T1. We have experimental data suggesting a connection between T1 and C1 pathway. Could a cluster analysis be done with the DCLS lung tumor gene expression data according to the genes belonging to the C1 pathway? We would like to see whether there is a significant difference in T1 expression status among various clusters. We are also interested in detecting statistically significant correlation of a particular cluster with the clinical parameters of DCLS such as survival. In a separate project, we have experimentally detected that T1 may have the activity of upregulating many genes involved in A1 pathway. We are wondering if we could first stratify DCLS dataset into two groups, then we compare the overall expression patterns of A1-related genes between the two groups. The basic question is: can we detect evidence supporting a positive regulation of overall A1 gene expression by T1 in the DCLS dataset?

Fri, 10/03/2014 at 11:07 AM

AccuraScience LB: Here is what we would propose to do: (1) Obtain the DCLS data, and if needed, processing it into formats that can be used for the following work (since they are microarray datasets, I suspect normalization would be needed - and if so, quantile-normalization will be performed so that the expression data would be directly comparable across samples).

(2) Split the samples into two subsets, namely T1-high and T1-low, according to the normalized expression of this gene, perform differential expression analysis of all genes between the two subsets, and perform GO-based pathway analysis on both up- and down-regulated genes from the results of the differential expression analysis. It is expected that C1 pathway and A1 pathway would show up among the most significantly enriched pathways. Other pathways showing up in the significantly enriched pathways might provide mechanistic insights for your future effort.

(3) Do it in a way opposite to (2), in that we will cluster the array data based on the expression of (a) all genes belonging to the C1 pathway, and (b) all genes belonging to the A1 pathway. At least two clustering methods will be applied (e.g., hierarchical clustering, K-means clustering and Self-organizing maps (SOMs)), and the results will be examined for consistency. It is expected that 2-4 interpretable clusters will result (e.g., we can interpret them as low-, intermediate-, and high-C1 (or C1) activity). And,

(4) Perform statistical analysis (chi-square test, or rank-sum test) to determine whether there is significant association between the clusters resulting in (3) and (a) T1 expression, and (b) any clinical and pathological annotation, e.g., survival time.

Fri, 10/03/2014 at 12:41 PM

Customer: The DCLS dataset also contains several non-cancer tissues as controls. The data may have been quantile-normalized.

The C1 and A1 pathways may not immediately show up as significantly enriched pathways. There may be a need to further stratify T1-high group into subgroups with highest or medium-high… Only then, enrichment of one or both pathways could be detected.

We will have to see how many clinical parameters are provided by the DCLS dataset. Patient survival is generally the factor that attracts most attention. If the DCLS study pans out, it would be great to replicate the “productive” analysis method onto the TCGA lung cancer data.

Sat, 10/04/2014 at 9:43 AM

AccuraScience LB: Notes taken. We will try to take advantage of control samples in DCLS in our analysis. We will fine-tune the splitting of samples according to TTF-1 expression level followed by pathway enrichment analysis. I would suggest that we come back to discuss replication using TCGA data as a separate line of work, after the work with DCLS datasets is completed.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.