Protein structures are sets of solved atomic coordinates representative of a three dimensional structure of a protein. Atom coordinates may be solved computationally or experimentally using a variety of techniques such as x-ray crystallography, electron microscopy and nuclear magnetic resonance.
The Protein Data Bank (PDB, Berman et al. 2000, available at the website of the Research Collaboratory for Structural Bioinformatics) contains over 40,000 experimentally-solved protein structures and is growing at a rate of 500 entries per month. This rate of growth is expected to increase as technology and methodology for experimentally solving protein structures become more accurate and readily available.
The rapid growth of protein structure knowledge has increased the need for organization of experimentally-solved protein structures into families of protein structures by clustering or categorizing protein structures. When a new protein structure is solved, the determination of its cluster or family of homologous protein structures facilitates the rapid characterization of the function and properties of the newly-solved protein structure. The identification of families or clusters of protein structures also provides a definition of the set of homologous proteins from which to characterize conservation of the protein structures.
Structural bioinformatics approaches to clustering protein structures have largely been based on single metrics of similarity between protein structures. The generation of a single similarity metric is complicated by the fact that proteins have multiple structural domains as well as an overall ternary or quaternary structure. Approximately 40% of the protein structures in PDB have multiple domains (Redfern et al., 2005). Protein structural domains may be shared over evolutionarily unrelated structures and do not always confer functional properties. Hence, proteins with overall similarity in structure, herein referred to as global similarity, may not have good local correspondence between domains. Conversely, proteins that have a high degree of local similarity due to evolutionarily conserved domains may not always have good global similarity due to structurally variable or unstructured regions, such as loops.
Due to the complexity introduced by structural domains, structural bioinformatics approaches to protein classification that have relied upon a single metric of similarity have failed to provide accurate clustering of protein structures. In fact, the protein structures in the Structural Classification of Proteins database (SCOP, Murzin et al., 1995, available at the website of the Medical Research Council Laboratory of Molecular Biology), a classification of protein structures widely used in the art of structural bioinformatics, are classified by human curators through visual inspection of the protein structure models in conjunction with structural bioinformatics analyses.
Thus, there is a need in the art for improved methods of automatically clustering or categorizing protein structures. The present invention addresses these shortcomings of the prior art.