The present invention relates generally to classifying and identifying polypeptides having similar structure or function based on comparative amino acid sequence analysis and more specifically to determining structure-related properties of a ligand when bound to a polypeptide of known amino acid sequence.
Structure determination plays a central role in chemistry and biology due to the correlation between the structure of a molecule and its function. In particular, a three dimensional model of a therapeutic target polypepetide can be of valuable assistance in the design or discovery of therapeutic drugs. The structure of a ligand bound to a polypeptide as observed in a three dimensional model can be used as a template for identifying structural properties to be incorporated into candidate drugs. Alternatively, using computer assisted methods a candidate drug can be identified based on structural properties that allow docking to a binding site in the three dimensional model of the target polypeptide, much as a key fits a lock. By structure-based methods such as these, lead compounds can be identified for further development.
Although methods for structure determination are evolving, it is currently difficult, costly and time consuming to empirically determine the three dimensional structure of a polypeptide. In general, determining such structures for polypeptides complexed with ligands is even more difficult. One approach to circumventing this difficulty is theoretical modeling of polypeptide structures with or without a bound ligand based on more readily available structural and functional information. Such theoretical modeling approaches are based on the tenet that the three-dimensional structure and function of a polypeptide are imparted by its amino acid sequence and the corollary that polypeptides with similar amino acid sequences have similar structure and function.
Theoretical determination of a three dimensional model for a polypeptide by ab initio methods is a relatively undeveloped method. However, another theoretical approach, referred to as homology modeling, has been used to infer structure for a particular polypeptide by threading its amino acid sequence through or overlaying the sequence upon a three-dimensional model of a homologous polypeptide. The successful application of homology modeling to determining polypeptide structure relies upon choosing a correct polypeptide template for comparison. In most cases criteria for comparison are unavailable or unreliable.
Thus, there exists a need for efficient methods to identify homologous amino acid sequences and to identify structural or functional characteristics of a polypeptide based on its amino acid sequence. A need also exists for methods to determine ligand binding properties of polypeptides based on sequence information. The present invention satisfies these needs and provides related advantages as well.
The invention provides a method for separating two or more subsets of polypeptides within a set of polypeptides. The method includes the steps of: (a) determining a sequence comparison signature for each amino acid sequence in a set of amino acid sequences, wherein the sequence comparison signature includes pairwise comparison scores for the amino acid sequence compared to each of the other amino acid sequences in the set; (b) constructing a distance arrangement including the sequence comparison signatures related according to the distance between each of the sequence comparison signatures; and (c) identifying a first and second cluster of sequence comparison signatures in the distance arrangement, wherein the first cluster includes sequence comparison signatures for polypeptides having a similar protein fold or biological function, the protein fold or function being different compared to a protein fold or function of polypeptides having sequence comparison signatures in the second cluster.
The invention also provides a method for identifying a member of a polypeptide family. The method includes the steps of: (a) determining a query sequence comparison signature for an amino acid sequence, wherein the query sequence comparison signature inlcudes pairwise comparison scores for the amino acid sequence compared to each amino acid sequence in a set; (b) comparing the distance between the query sequence comparison signature and the sequence comparison signatures for other amino acid sequences in the set, wherein the sequence comparison signatures for other amino acid sequences in the set are clustered into polypeptide families; and (c) identifying a proximal cluster having one or more sequence comparison signatures that have a closer distance to the query sequence comparison signature than the sequence comparison signatures of a distal cluster, thereby identifying the polypeptide having the query sequence comparison signature as being a member of the polypeptide family for the proximal cluster.