The relationship between the amino acid sequence of a protein and its three-dimensional structure is at the very core of structural biology and bioinformatics. Although much structural data on proteins has been collected, there remains a need for a general algorithm for deducing the folding of a protein, i.e., its three-dimensional structure, from its amino acid sequence. Some successful approaches to predicting the three-dimensional structure and function of proteins have been based on the fact that the primary protein structures of a large number of proteins is currently known, and based on similarities, are organized into a fewer number of groups, or families. Proteins within the same family are presumed to share the same three-dimensional structure.
Proteins, or polypeptides, are amphiphilic polymers containing a mixture of polar and non-polar side chains. This physical property places an upper limit, of approximately 300-400 amino acid residues, on the size of individual folded regions of a protein, called domains. Thus, only a few thousand unique folds within the domain regions are expected to occur naturally. Folding refers to the secondary structure of the proteins, i.e., α-helices, β-sheets and loops. Conservation of the three-dimensional structure of a protein, e.g., secondary or tertiary, typically correlates to conserved regions of the amino acid sequence defining the primary protein structure. Such conserved regions of the sequence are termed “signature” sequences as they signify a given three-dimensional structure.
The identification of these signature sequences is often conducted using similarity search software, such as the FASTA, BLAST/PSI-BLAST, and Smith-Waterman programs. Such similarity search software programs conduct direct pair-wise comparisons of a query sequence with every sequence present in a database. Alternatively, conserved sequence patterns in a set of multiple aligned sequences may be identified. If enough multiple aligned sequences are available, they can be used to build a Markov model and a search engine suitable for searching databases looking for more instances of similar patterns.
The pattern discovery algorithm, Teiresias, has been used to identify and build a very large collection of sequence patterns, or seqlets, by processing the GenPept database as a whole (the process is also routinely repeated at regular intervals on increasingly larger installments of the SwissProt/TrEMBL database). For a discussion of the Teiresias algorithm, see, for example, Floratos, et al., U.S. Pat. No. 6,108,666, “Methods and Apparatus for Pattern Discovery in 1-Dimensional Systems”; Floratos, et al., U.S. Pat. No. 6,092,065, “Methods and Apparatus for Discovery, Clustering and Classification of Patterns in 1-Dimensional Event Streams”; Rigoutsos, I. and A. Floratos, “Combinatorial Pattern Discovery in Biological Sequences: the Teiresias Algorithm,” Bioinformatics, 14(1):55-67, 1998; and Rigoutsos, I. and A. Floratos, “Motif Discovery Without Alignment Or Enumeration,” Proceedings 2nd Annual ACM International Conference on Computational Molecular Biology, New York, NY, March 1998, the disclosures of which are incorporated by reference herein. Generally, each sequence pattern is a string of literals interspersed with zero or more “wild-cards.” The location of each literal can be occupied by either a unique amino acid or a small set of permitted amino acids, whereas the location of each of the wild cards can be occupied by any amino acid. Take for example the pattern [SEQ. ID. NO. 1]: {KR}.K{ILMV}{AG}L, wherein each literal is shown bracketed, and each wild card position is represented by the symbol “.”. This particular pattern describes all hexapeptides that begin with either a lysine or an arginine, followed by any one of the 20 amino acids, followed by a lysine, followed by any one of {isoleucine, leucine, methionine, valine}, followed by any one of {alanine, glycine} and finally a leucine.
The patterns contained in this collection, known as the Bio-Dictionary, have been found to identify structural and functional properties that cross protein family boundaries. The Bio-Dictionary pattern collection nearly completely covers the currently known sequence space of natural proteins and can thus be used in lieu of the original sequence database for applications such as similarity searching, protein annotation, and gene finding.
The transmembrane helices of polytopic proteins are common building elements of many large, biologically important structures, such as tissue-specific or ligand-specific receptors (or both) and enzymes. Non-canonical conformations occur frequently in these helices and are critical determinants of their structure and function. Unfortunately, the structural study of such proteins has been hindered by the inability of researchers to successfully crystallize samples for analysis. Observing the three-dimensional structure of these non-canonical regions has been a challenge. Recent focus has shifted to the analysis of amino acid sequences, i.e., the primary structure of these proteins, when it was discovered that the non-canonical conformations and the respective sequences encoding them are often conserved. However, the sequences encoding non-canonical conformations are generally only a few amino acid residues in length. Thus, traditional approaches using sequence similarity tools or Markov models are ineffective as are the traditional secondary structure prediction methods (e.g., some of the public prediction servers suggested a β structure in place of a helical kink). It is thus beneficial to have a method for analyzing polytopic proteins, specifically non-canonical conformations within those proteins from the primary protein structure.