Genes are the blueprint for polypeptides and proteins, which in turn are comprised of chains of amino acids. The genetic message in nucleic acid form in DNA is transcribed into messenger RNA ("mRNA"), and is further decoded through translation to generate the specific amino acid sequence in protein biosynthesis. The resulting proteins typically fold up in solution. Genes and proteins interact in complex ways and play myriad roles in organisms. The Human Genome Project, which seeks to identify all genes in the human genome, has renewed interest in sequencing, e.g., in reading the code or blueprint represented by genes. But even when the many genes within the human genome are identified, it will still be necessary to understand gene functions within cells to make full use of such knowledge.
When a set of proteins have a common evolutionary history, this history is correlated with changes in the function and structure from their common ancestors. If the evolutionary history can be identified, one can construct a phylogenetic tree. However prior art techniques for phylogenetic tree inference mostly are techniques developed for DNA sequences, in which the presence of a relatively recent common ancestor permits simplified mathematical assumptions about how such sequences evolved over time.
The starting point for phylogenetic tree estimation is generally a multiple sequence alignment ("MSA"), such as shown in FIG. 1, which displays sequence alignment from different genes or proteins in an attempt to show common structure. MSAs may be constructed using various known techniques, and are publicly available from many sources including the Internet. If similarities can be perceived from MSA data, one may gain useful information as to common function and structure (or "conservation") between proteins.
Prior art techniques such as the distance-based or neighbor-joining method, the maximum likelihood method, and the parsimony method have been used for decades to infer phylogenetic relationships among species.
Evolutionary processes can include speciation and gene duplication. In speciation, a gene produces a protein having a particular function, and over time slight differentiation occurs in the protein as it evolves in different species. Thus, one type of phylogenetic inference, species trees are constructed with a topology in which the species form leafs (representing what is observed in the data) on a tree. Attempts are made to properly form the tree to show relationships, e.g., which is closer to man, the chimpanzee or the gorilla. Leaves may be joined to nodes or internal nodes, and the common ancestor will be found at the common root or root node of the tree. A species tree will reflect the evolutionary relationship between species.
In another type of phylogenetic inference, gene duplication events are modeled, in which paralogous genes or proteins are created. For examples, globins in plants, in muscles, in blood all stem from a common globin ancestor that antedated the split in speciation between plants and animals. Genes can be copied at any point along the species tree, with one gene retaining original functionality and a copied gene obtaining different function. In constructing an evolutionary tree for paralogous genes, a gene tree results in which genes are clustered within subtrees, and in which a common ancestor gene (that was subsequently copied) appears at the tree root.
Prior art techniques simply do not perform well in attempting to form phylogenetic inferences where gene duplication has occurred. This shortcoming appears to result because certain positions in a protein are very important to the function of the protein, e.g., residues in certain protein positions are extremely important as they determine what that protein will interact with, or the specific protein function. In gene duplication, the proteins that the genes encode have a freeing of functional and sometimes structural constrains. For example, amino acid constrains can change at some positions in the molecule, while some positions remain perfectly constrained, while other positions have subfamily-specific conservation.
In a set of sequences, one may divide conservation types into a few different variants, for example general variants (all of the proteins have the same type of conservation) or subfamily is specific. For these two divisions, conservation types may be perfectly preserved or may be variant. When a position is conserved within each subfamily ("sf") but can differ across subfamilies, a key exists for a kind of functional specificity of that group of proteins. Thus, if within a group of proteins or DNA a particular residue (e.g., amino acid or DNA nucleic acid) is required to be maintained, it is so required for reasons of structure or function, and not by chance. Therefore in terms of evolutionary distance, as diversity within a group containing this conservation key or signal increases, the signal will be a more important indicator that this position is significant.
In creating a phylogenetic tree with a group of different subfamilies, it is known that subfamily-specific conservation positions will have functional, phylogenetic, or evolutionary constraints. Thus, in forming a phylogenetic tree topology, the similarity at such constrained positions should be reflected as being more important as compared with similarity at positions that are very variant. This distinction cannot be made by prior art techniques for constructing phylogenetic trees. Simply stated, prior art techniques cannot identify such subfamily-specific well conserved positions.
Such shortcoming is aggravated by the reliance of prior art techniques upon substitution matrix-based methods in assessing penalties in joining two groups of proteins. Substitution matrices are used by biologist to form profiles of expected amino acids to homologs (or relatives) to a set of protein sequences. Each position in the profile corresponds to a column in an MSA for the proteins, and reflects the amino acids expected among homologs not contained in the data in the MSA. These methods employ only the relative frequency of the amino acids at each position in the MSA, and ignore the actual number of amino acids observed. Because of this, a column containing a single D is treated identically to one containing one hundred Ds. Substitution matrices generalize every position to allow amino acids similar to the observed amino acids at a given position. Such techniques create a profile or probability distribution of the expected amino acids over all the amino acids. However these techniques fail to identify functional or structural constraints within a phylogenetic tree under construction, and simply do not seek to identify a subfamily-specific conservation signal in determining what subtrees should be joined.
Phylogenetic trees can be used to identify subfamilies. Phylogenetic trees describe the often complex evolutionary relationships among a set of sequences. Such trees are challenging because the number of ways to cut a tree into subtrees is enormous, and it is computationally very expensive to examine all possible cuts. Compounding the problem is the typical lack of a priori knowledge of the correct number of subfamilies in the data. Phylogenetic trees can be cut into subtrees using known algorithms to define the subfamilies in the data. But although many phylogenetic tree construction algorithms are known, no automated methods for cutting a phylogenetic tree into subtrees to infer subfamilies in the data are found in the prior art.
Even when phylogenetic trees are employed in analysis, they are not used to help refine statistical models of subfamilies in the data. Employing all the pair wise relationships is currently computationally prohibitive, and with no automated ways to infer the subfamilies based on a phylogenetic tree, subfamily relationships are equally hard to incorporate into statistical models. While a phylogenetic tree may find use in supplementing a scientific understanding, they have not found use in statistical model construction. Thus, although there is a growing need for automated sequence analysis tools for large-scale annotation of sequence databases, it is not presently known how to incorporate the information in phylogenetic trees into automated tools.
At best, in the past incorporating subfamily relationships into statistical models meant modeling each subfamily separately. This prior art approach was not optimal as much valuable information concerning important positions defining the common fold or function of all the proteins as a whole was simply discarded.
Thus, there is a need for an agglomerative technique to create hierarchal or phylogenetic trees from MSA data to attain a more accurate understanding of functional and/or structural similarity of evolutionary history for proteins under analysis. Preferably such a technique should be automated to recognize the importance of functional and/or structural constraints in an MSA. Thus information should be used to determine what subtrees should be joined together.
There is a need for a preferably automated technique to guide cutting a phylogenetic tree into subtrees, preferably as a function of encoding cost at each node in the phylogenetic tree. Further, there is a need to create statistical models from data representing subtrees formed from the decomposition of a phylogenetic tree. Such models should be able to identify and align remote homologs. Further, there is a need for a technique to provide a position-by-position analysis from MSA data starting from subfamily decomposition data in which subfamily conservation residue signals are employed.
The present invention provides such techniques as well as an automated system for their implementation.