Proteins fold into a three-dimensional structure. The folding of a protein is determined by the sequence of amino acids and the protein's environment. Aligning proteins is a subject of utmost relevance. It enables the study of functional relationship between proteins. It also is very important for homology and threading methods in structure prediction. Furthermore, by grouping protein structures into fold families and subsequent tree reconsideration, ancestry and evolutionary issues may get unrevealed. An example of the importance of identifying protein structures can be illustrated by the comparison of DNA binding homeodomains from two organisms separated by more than 1 billion years of evolution. The yeast α2 protein and the Drosophilla engrailed protein, for example, are both regulatory proteins in the homeodomain family. Because they are identical in only 17 of their 60 amino acid residues, their relationship became certain only when their three-dimensional structures were compared.
Structure alignment amounts to matching two three-dimensional structures such that potential common substrates, e.g. α-helices, have priority. The latter is accomplished by allowing for gaps in either of the chains. Also the possibility of permuting sites within a chain may be beneficial. At first sight, the problem may appear very similar to sequence alignment, as manifested in some of the vocabulary (gap costs, etc.). However, from an algorithmic standpoint there is a major difference since the minimization problem is not trivial due to rigid body constraints. Whereas sequence alignment can be solved within polynomial time using dynamical programming methods (e.g. Needleman S. B. & Wunsch C. D. (1971) Identification of homologous core structures. Proteins 35:70-79), this is not the case for structure alignment algorithms since rigid bodies are to be matched according to these constraints. Hence, for all structure alignment algorithms the scope is limited to high quality approximate solutions.
Existing methods for structure alignment fall into two broad categories, depending upon whether one (1) directly minimizes the inter-atomic distances between the structures, or (2) minimizes the distance between substructures that are either pre-selected or supplied by an algorithm involving intra-atomic distances.
One approach is an iterative dynamical programming method (e.g. Laurants D.V. et al. (1993) Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. J. Mol. Biol. 3:141-148; and Gerstein M. & Levitt M. (1996) Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. In: Proceedings of the 4th International Conference on Intelligent Systems in Molecular Biology, Menlo Park, Calif.: AAAI Press). In this approach one first computes a distance matrix between all pairs of atoms (e.g. Cα) forming a similarity matrix, which by dynamical programming methods gives rise to an assignment matrix mimicking the sequence alignment procedure. One of the chains is then moved towards the other by minimizing the distance between assigned pairs. This method does not allow for permutations, since the internal ordering is fixed by construction. In another inter-atomic approach the area rather than the distances between two structures is minimized (e.g. U.S. Pat. No. 5,878,373). In yet another approach, one compares distance matrices within each other of the two structures to be aligned, which provide information about similar structures (e.g. Holm L. & Sander C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233:123-138; and Lu G. (2000) A new method for protein structure and similarity searches. J. Appl. Cryst. 33:176-183). The similar structures are subsequently matched. In these methods, for instance by Holm & Sander as well as by Lu, permutations can in principle be dealt with.
However, there are implementation issues shared by both types of methodologies mentioned above. One is structure encoding (Cα and/or Cβ of the chains). For many methodologies Cα appears to be sufficient, whereas in some cases Cβ is needed. Also, the choice of distance metric is a subject of concern in order to avoid the influence of outliers.
The present methods are useful in certain types of problems of protein structure alignment and less useful in others. Some methods only partially explore the space of possible alignments or lack the ability to handle permutations efficiently. In addition, as mentioned above, the minimization problem for protein structure alignment is non-trivial due to the rigid body constraint. Accordingly there is a need to develop a general method that not only provides an acceptable solution for the minimization problem, but also has a high assurance of protein structure alignment and prediction and thereby applicable to a variety of problems.