At the end of the year 2003 more than 23,000 protein structures had been deposited in the Protein Data Bank (PDB), with more structures being discovered with each passing day. This surge in structure information has lead researchers to look out for efficient techniques to compare protein structures, detect motifs, classify proteins under specific families, etc.
While the current dogma of genetics connecting sequences to structures (sequence→structure→function) suggests that it would suffice if sequences were studied in greater details to detect similarity in proteins and classify them, there are several instances when different sequences yield the same structure. Hence there is a concerted effort to work with the 3-dimensional structure of proteins directly.
With all this structural information overflow emerges new necessities: that of identifying similar structures and mapping them to families, a quick and fast way to detect similarity, identify motifs, find longest contiguous alignments, etc.
Distance matrices are known to be used for various protein structure-related work. DALI (proposed by Holm and Sander) is a well-known structure alignment algorithm utilizing the concept of distance matrices. In DALI, the three-dimensional coordinates of each protein are used to calculate residue-residue (Calpha-Calpha) distance matrices. The distance matrices are first decomposed into elementary contact patterns, e.g., hexapeptide-hexapeptide submatrices. Then, similar contact patterns in the two matrices are paired and combined into larger consistent sets of pairs. A Monte Carlo procedure is used to optimise a similarity score defined in terms of equivalent intramolecular distances. Several alignments are optimised in parallel, leading to simultaneous detection of the best, second-best and so on solutions.
A need exists, however, for an improved manner of processing protein structure information.