Human Gut Microbial Population and Disease (12/10/2013)


Sun, 12/08/2013

Customer and her graduate student are Human Microbiome Project researchers, and they have approximately 30 fecal samples pyrosequenced by 454. Basic metadata is available (age, gender, autism diagnosis, and family unit groupings). Raw data has been cleaned using mothur and has been submitted to MG-RAST, giving access to a profile of community abundance in each sample. They ask for suggestions on determining whether or not there are significant changes at either the genus or possibly even species level in individuals with autism when compared to either related or unrelated controls.

Sun, 12/08/2013 at 9:12 AM

AccuraScience LB: Based on your description I assume that you were doing targeted amplicon (for 16s rRNA) sequencing rather than a whole metagenomics study. Could you confirm this? Metagenomics would have the additional advantage of providing functional clues (e.g. what genes, or genes of which functions in those microbial species are associated with autism). Would you explain why you didn't go for the whole metagenomics approach?

Could you confirm that you have completed the MG-RAST analysis, which provided you with the estimated taxonomic structure with a fraction assigned for each taxonomic unit - for each sample? I wonder what other output you have got from MG-RAST.

I would need to think a little bit about what's the best statistical approach to take for your situation. Because you have multiple samples for each category (autism patient, family member of patient, and outside control), it seems proper to perform ANOVA type of analysis for each taxonomic unit. However, if the data you have are fractions, it makes ANOVA difficult. If the read numbers for each sample are all the same, we might use read numbers in ANOVA. On top of all these considerations, there's also the question of how to take the hierarchical structure of taxonomy into account when we look at multiple taxonomy units - some strategies used in GO (gene ontology) analysis - which involves the hypergeometric distribution assumption - might be appropriate to apply.

Sun, 12/08/2013 at 11:02 PM

Customer: Yes, you're right, we did 16S rRNA sequencing as our lab is primarily interested in the dynamics of the entire community rather than individual genes.

You're also correct that we have used MG-RAST to look at the proportions of the community at different taxonomic units, but (as you probably know) MG-RAST is a little clumsy since you need to backtrack and generate new sets of data for each level. I can send you a sample of our data as requested, but can you please clarify if you'd like a portion of data from one single fecal sample, or if you'd like data from several samples?

Reads were not consistent between samples; there was variability within each multiplexed sample, as well as variation between sequencing runs. Can you suggest how best to deal with this issue?

Tue, 12/10/2013 at 7:47 PM

AccuraScience LB: This is perhaps a little more complicated than you would like to hear. I explained the challenges in this analysis in last email. Essentially, both the groups and the taxonomic units have a hierarchical structure. To do it "right" would be mathematically involving.

There are two ways that we might try to do this:

(1) Multi-response generalized linear mixed effect model

We will can fit a mixed effect model which use the matrix of taxonomic units (X) as response variables and use other factors (y and Z, like autism indicator, sex, age and etc.) as covariates. Mixed effect can be added based on the family groups.

The problem with this method is that it does not account for the hierarchical structure for the taxonomic units, thus could lead to inconsistency, e.g., many species within a genus are significant, but the genus is not significant.

(2) Logistic regression with lasso and neighbor smoothing

Let X_ij be the normalized abundance of taxonomic unit j from sample i. Let y_i be the indicator whether or not the sample i has diagnosed with autism. Let Z_i be covariates other than taxonomic units (such as sex, family group and etc.). We will fit a logistic regression model like this:

y_i = logit (alpha * Z_i + beta * X + b)

Unlike the simple logistic regression, regularization will be put on beta to encourage sparsity (e.g. only a few taxonomic units will be important to discriminate autism), and smoothen the coefficients of taxonomic units which are neighbors in the taxonomic tree. So this is the more "correct" way of doing it.

The practical issue is, whether you would consider either or both of these methods too involving. Particularly if this is your thesis project, I imagine you might not want to go too deep in the mathematics... But our determination is, this is perhaps what it will take to do it "right".

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.