09/30/2014
Tue, 09/30/2014 at 12:44 AM
Customer: I have a project using Affymetrix CytoScan plateform (Whole-Genome 2.6M CytoScan cytogenetics arrays) to try to identify genetic causes of differences between individuals and cell strains. This study includes over 100 cell strains which produced a huge amount of data for GWAS analysis.
Tue, 09/30/2014 at 3:56 PM
AccuraScience LB: The processing of the CytoScan arrays would be considered as Track 1 work (please visit http://www.accurascience.com/pricing.html for explanation of the three tracks in our pricing model), thus would be quoted by sample number. The sample number 100 mentioned in your message seems a little small for a GWAS study...
Because the ways to conduct GWAS vary greatly, we would like to know a little more about your considerations in this part of the work. In particular, would you want us to do it in a classical (or frequentist) way, or by Bayesian strategy? Is population stratification an issue of concern? If so, how would you intend to have us deal with it - using PCA-based methods, or developing a mixed effects model? Are you interested in identifying rare variants? If so, would you considering a collapsing strategy to increase the power of the analysis? Would you consider some of the "special" GWAS techniques, e.g., pathway-based GWAS?
Tue, 10/07/2014 at 3:11 AM
Customer: The number of samples in the project is 124 but there are repeats and also before and after treatment. So we have something between 150 and 200 samples to analyze.
As for the method of analysis, I myself not sure what to use. The attached publication gives you a glimpse of this type of work. But also we have molecular cytogenetic aspect in the Cytoscan platform.
Wed, 10/08/2014 at 3:35 PM
AccuraScience LB: I have got the chance to quickly go over the paper you forwarded, as well as 4 papers cited in it that did similar GWAS work on the same treatment. These studies did not involve most of the "advanced" methods I mentioned in last email, such as Bayesian methodology, collapsing strategy or pathway-based GWAS. All of them used straight-forward linear model type strategy or multivariate logistic regression for association analysis, but two of them also looked at population stratification issue by plotting Q-Q plots and PCA analysis. Thus it would be proper for your project to follow a similar general scheme.
I have some quick questions/comments: (1) Are the individuals participating in the study all Caucasian? A mixed population (e.g., with both Caucasian and African Americans) would likely make the analysis more complicated (due to population stratification considerations), and considering the relatively small sample size, would make it less likely to identify significant variants. (2) It is sensible to do it with a 2-stage design, with about a half of the cases and controls treated as "discovery cohort", and the remaining reserved as the "replication cohort". This would require a design of a customized SNP array based on what we find in the discovery cohort, and carrying out the array experiments and genotyping analysis on the customized array on the replication cohort. (3) You mentioned repeats and same individuals before and after treatment - if I understand it correctly. This might complicate the design - the most common design for GWAS is case-control design, where there should be no relationships between the case group and control group. Could you tell me a little more about how the case and controls are defined in your study?
Generally, the work would consist of the following steps: (1) Processing/analyzing the CytoScan array data for the discovery cohort, to get genotyping results, followed by quality control. (2) Imputation, using IMPUTE2 or MACH, based on 1000-genomes project or HapMap data, (3) Examining population stratification issue, and if problems are found, address them (there are uncertainties in this step), (4) Association testing for discovery cohort - will try to use Plink to do this, (5) Process/analyze the customized SNP array data for the replication cohort, followed by imputation (if needed), and (6) Association testing for replication cohort.
Back to Other Selected Recent Inquiries
Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.