9/22/14
Mon, 09/22/2014 at 3:09 AM
Customer: I have a dataset of serum binding on peptides (several 1000's of peptides). There are several (range 40 - 200) sera. need the following: an R script that can process the data in two ways: 1) cluster analysis of similarity based on the data-points. Not on single peptides, but the groups of peptides that together define the way that the sera belong together in groups. 2) I will have another matrix containing data on the sera. This is a sparse matrix with just a few variables, that will take discrete values. This matrix can then be used to separate the groups of sera. What then the anaysis script must be able to do is fish out which datapoints or clusters of datapoints from the large matrix together best explain/define the separation that is imposed by the sparse matrix. Script must be in R. Could you do this?
Mon, 09/22/2014 at 8:42 AM
AccuraScience LB: In the first portion of the work, there are multiple clustering or clustering-like methods that can be used, including (but not limited to) hierarchical clustering, K-means clustering, self-organizing maps (SOMs), and even principal component analysis (PCA). Each of these methods would need some parameter tuning, e.g., for K-means clustering, we would need to decide what's the best K value to choose, and we also need to define a reasonable scheme to determine the K initial clustering centers.
The second portion of the work is often called "classification" in the machine learning field. It might be two-class classification or multi-class classification, depending on how many classes the samples are labeled with. Some of the classification methods (or algorithms) people commonly use include linear regression, logistic regression, decision trees, and support vector machines (SVMs). For this type of tasks, my experience is that you would want to split the data into two subsets, and using one of them in "training" of the classification model, and the other for "testing", or evaluating of the performance of the model.
Thus, the way I understand it, writing of the code is a relatively minor part of this project - almost everyone here at AccuraScience can write the R code to your satisfaction once the methods are well defined. The more critical part of the work is the choosing and testing of the comparable methods for each of the two portions of the work, and to determine what's the best methods to use, which will take more discussion between you and us, and perhaps some trial-and-error style testing.
Back to Other Selected Recent Inquiries
Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.
Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer’s privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.