Homology-based search vs. pattern-based search (1/5/2016)


Tue, Jan 5, 2016 at 6:42 PM

Customer: I am searching for a protein in human proteome that would have very low sequence similarity to its relatives from other species. Based on blast searches and sequence similarity the family of the protein I am looking for is broadly distributed among bacteria, plants, fungi and a few sea animals. But it is not found in mammals. We believe that this protein might be present in human proteome but because of low sequence similarity it does not appear during blast searches. The sequences in these families of proteins from different organisms have low similarity (some have ~10% sequence identity) which probably makes bioinformatics based searches harder. I would like to ask you if you had experience before searching for something similar to what I described using bioinformatics

Wed, Jan 6, 2016 at 8:26 AM

AccuraScience LB: It would be quite unusual for a protein to exist in so many other kingdoms of life and in lower Metazoa but not in mammalians. If lower homology level is expected, you would have to loosen the Blast search criteria, which in turn will produce a large number of candidate hits, all but a small handful are false positives. If this is what you have tried, and the number of candidate hits is just too large to sift through, then you would have hit the limit of homology-based approach, and would have to resort to borrowing power from other, non-homology based stategies. For instance, would it make sense to narrow down the search to evolutionarily conserved genomic regions? Is there other knowledge about the protein that can be taken advantage of, e.g., does it expect to have higher GC content in its DNA sequence?

Wed, Jan 6, 2016 at 10:45 AM

Customer: I am really happy to hear that one would expect such broadly distributed protein to be present in mammals.

We have tried to loosen the blast search criteria, just nothing that is coming out during such search makes sense. The hope is that there might be some other methods to search with additional parameters like transmembrane helices and amino acid similarity in the transmembrane regions but not in the soluble domains of this protein.

The protein we are searching for (hopefully) belongs to a particular family, and the number of amino acids in different species ranging from a little over 100 and up to 650. In 90% of eukaryotic organisms it has 9 transmembrane helical topology, where first 4 and last 4 have sequence homology. Helix #5 is usually a random transmembrane sequence that has no similarity between various species. The motifs that have sequence similarity are mostly conserved in the last 4 helices (TM#6, TM#7, TM#8 and in TM#9). The residues we would like to restrict during the search are:

In TM6:

-XXXXXXXG(A)XXXR- (G(A)XXXR motif should be at the end of helix # 6)

In TM7:

-PXGTXXXNXXXXXXX- (P should be the first amino acid of the helix # 7, N should have 6 amino acids after P before N)

In TM8:

-XXXXXGXXXXLS(T)S(T)I(V,L)S(T)S(T)F- (this one might have hope for similarity-based searches because it is the most continuous region of high similarity among species. However, it should be in transmembrane region and blast-based similarity searches always give sequences where this region is not predicted to be in TM)

In TM9:

XXYXXXS(T)XXXXXXX (YXXXS(T) region should be close to beginning of TM9)

As you can see, if this is all essential information we have and if the protein we are searching for is ~500 amino acid long, it would probably give ~3% of sequence similarity. So, it just would not appear in BLAST search.

Wed, Jan 6, 2016 at 2:27 PM

AccuraScience LB: Homology-based search will not work in this case. Rather, a pattern-based search would need to be defined and carried out.

We could write code to search among all known human protein sequences patterns similar to the ones you specified for TM6-9. Other criteria could be included in the pattern, e.g., if you could specify a distance range between the G(A)XXXR motif in TM6 and PXGTXXXN motif in TM7, this would help. If not, we can at least specify that the first motif should occur upstream of the second. Any additional rules based on protein sequences can be defined. It might take a few iterations of refinement, that is, the first set of rules may be too loose thus too many potential hits are returned, or too stringent so that no hits are returned. Then, once we get some hit sequences with proper annotation (name and description of each protein), it would take some of your manual work dig further about the structures, e.g., are there in fact those 9 TM domains around?

An alternative approach is, instead of defining sequence patterns, we could define structural patterns. The main limitation of the structural pattern-based approach is that it can only search among proteins with known structures.

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer's privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.