Phylogenetics is the reconstruction of a pattern of events that have led to the distribution and diversity of life. The distribution and diversity of life is represented by a phylogenetic tree structure.
A phylogenetic tree is a representation of an evolution of species. FIG. 1 is a diagram of a complete phylogenetic tree structure. A phylogenetic tree is also known as a cladogram or a dendrogram. The tree structure includes vertices and edges. An individual organism or species is presented by a vertex, such as 100. Vertices are also known as nodes. Time is presented along the vertical dimension of the tree, in which lower vertices represent later generations of life. Edges, such as 102, connect vertices. When edges converge at a vertex, the point of convergence is the common ancestral species or ancestor. In phylogenetic trees in which each vertex represents a species, that vertex represents the point when the two species diverge from a common ancestral species. All inner nodes have degree of at least three. The degree of a node is the number of edges emanating from the node.
An approximation of how close two species or individuals are related is determined from a distance between species or individuals. Any two species or individuals have a unique most recent common ancestor in the tree. Two species or individuals are closely related if the common ancestor is recent and distantly related if the common ancestor is remote. To calculate a measure of the distance between species or individuals, the distances to their common ancestor are summed. A distance is the smallest count of edges between two vertices. For example, where an ancestor 104 occurs two time units in the past, then the distance between the two species or individuals, 106 and 108, would be four because two edges are traversed between the common ancestor 104 and each of the species or individuals. The four edges traversed between vertices 106 and 108 are 110, 112, 102, and 114, and yield a distance of 4.
Each individual has genetic properties represented by a string of characters or letters. The string of characters is represented by data in a corresponding vertex. For example, where a genetic property is a DNA sequence or sub-sequence, the property is represented by a string of characters selected from a group comprising the characters, “A,” “C,” “G” and “T.” For example, a parent vertex 100 might have the DNA sub-sequence “GATCTT” and a child vertex 108 might have the DNA sub-sequence “GATATT.”
Small mutations exist in the genetic properties between parent and child. The mutations are represented by differences in the character strings. When the topology of a phylogenetic tree is unknown, and thus the actual distance between vertices is unknown, the only possible way to reconstruct a tree is from the mutation between species or individuals. The more characters two organisms share, the closer they are presumed to be evolutionarily, and the closer together they should cluster on a phylogenetic tree.
Stochastic mutation models are used to model the mutations along phylogenetic tree. Stochastic mutation models include the Cavender, Farris and Neyman (CFN) model, the Kimura 2-parameter model, the Kimura 3-parameter model, and even more general stochastic models.
FIG. 2 is a diagram of a reconstructed phylogenetic tree structure. The entire topology of the phylogenetic tree is reconstructed. Unknown nodes 100, 116 and 104 are reconstructed from known nodes 106, 118, 108 and 120. Reconstructing the entire topology requires the calculation to be performed from a fairly large portion of the character sequence of each known node. Examples of sequences that are used to reconstruct the topology are sequence 122 “CGCT” in node 106, sequence 124 “ACCT” in node 118, sequence 126 “ATAT” in node 108 and sequence 128 “ATTT” in node 120.
Conventionally, the number of characters in the sequence that is required to reconstruct the topology is a polynomial function of the number of known nodes. For example, if the polynomial function is n3, then 64 characters in each sequence are used to reconstruct the topology from 4 known nodes because 64 is the cube of 4. In another example, 1000 characters in a sequence are used to reconstruct the topology of 10 known nodes because 1000 is the cube of 10 known nodes. The polynomial function of the number of known nodes often yields a large number. This is problematic because often a large number of characters of a sequence is not available from all of the known nodes.
After the entire topology of a phylogenetic tree is reconstructed, the characters or sequence of each of the reconstructed nodes are estimated or inferred from the topology. Conventional methods of estimating the characters include the parsimony method and the maximum-likelihood (ML) method. The ML method is based on choosing the value of an unknown parameter under which the probability of obtaining known samples is highest. Often, the topology of the phylogenetic tree is not known with a high degree of accuracy. Estimating the characters using the ML method from a less than highly accurate phylogenetic tree topology provides less than accurate characters.
For the reasons stated above, and for other reasons stated below which will become apparent to those skilled in the art upon reading and understanding the present specification, there is a need in the art for reconstructing a tree topology from a smaller portion of the data of known nodes. There is also a need for improved accuracy in estimating data of the nodes of a phylogenetic tree from a less than highly accurate phylogenetic tree topology.