The advances of molecular biology have made possible the comparative study of proteins and the nucleic acids (DNA and RNA), which are repositories of hereditary (evolutionary and developmental) information. Nucleic acids and proteins are linear molecules made up of sequences of units—nucleotides in the case of nucleic acids, amino acids in the case of proteins—which retain considerable amounts of evolutionary information. By comparing two nucleic acids or proteins with one another, the number of their units that are different can be established. For example, the units that are different could be DNA base changes (e.g., Adenosine to Guanine) or the insertion and/or deletion of nucleotides. The number and type of differences are indications of the recency of common ancestry. This allows comparisons to be made between very different sorts of organisms by comparing the sequences that arise in two or more sequences of nucleic acid or protein. For example, organisms as diverse as yeasts, pine trees, and human beings can be compared since there are homologous nucleic acids that can be compared in all three.
The comparisons can be used to provide information not only about the topology of evolutionary history (cladogenesis), but also about the amount of genetic change that has occurred in any given lineage (anagenesis). For example, cytochrome c (a protein molecule) of humans and chimpanzees consists of the same 104 amino acids in exactly the same order; but differs from that of rhesus monkeys by 1 amino acid, that of horses by 11 additional amino acids, and that of tuna by 21 additional amino acids. This similarity is believed to reflect the recency of common ancestry. Thus, the inferences from comparative anatomy and other disciplines concerning evolutionary history can be tested in molecular studies of DNA and proteins by examining their sequences of nucleotides and/or amino acids.
A cladogram based on parsimony can be constructed from the relationships found between the sequences (of nucleic acids or proteins) for different organisms. A cladogram based on parsimony is a branching diagram representing the distribution of derived characters within a set of taxa (units used in the science of biological classification), such that the total number of evolutionary events is minimized. In the cladogram, the type of change of one taxon to another indicates the degree of relationship; i.e., closely related groups are located on branches close to one another.
The determination of the length cost of a cladogram is known to be NP-complete (non-polynominal) when the internal nodal sequences are unknown, so that the overall cladogram length is minimized. This is understood when one contemplates the increasing number of possible sequences as the number of observed sequences increase. In principle, all possible sequences of lengths 1 to the sum of all terminals with all possible combinations of nucleotides may occur. Wang and Jiang (L. Wang & T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational Biology 1:337–348 (1994).) discussed this in their proof of NP-completeness and Wheeler (W. C. Wheeler, Alignment Characters, Dynamic Programming, and Heuristic Solutions, in Molecular Approaches to Ecology and Evolution 2nd Edition 243–251, (R. DeSalle & B. Schierwater, eds., Birkhäuser Verlag, Basel Switzerland, 1998).) in terms of optimization alignment (W. C. Wheeler, Optimization Alignment: the end of multiple sequence alignment in phylogenetics, Cladistics 12:1–9 (1996).).
To find an exact solution, all possible sequences can be enumerated and tried at each internal node, or for example, a branch and bound method could be used. Such an approach would guarantee the optimal solution, but is a time-consuming method and requires large amounts of computational resources.
Another approach to the estimation cladogram lengths has been one of constructing a point estimate for hypothetical ancestral sequences and then using this to determine an upper-bound on cladogram length. For example, the coupled processes of multiple sequence alignment and separate phylogenetic reconstruction does this through establishing global, static homologies to deal with length variation and then using standard optimization techniques to estimate the internal node character states. Optimization alignment (W. C. Wheeler, Optimization Alignment: the end of multiple sequence alignment in phylogenetics, Cladistics 12:1–9 (1996).) takes a more direct (and explicit) approach to this estimation by establishing cladogram-specific homology schemes in a preliminary pass and constructing ancestral sequences in a second, up-pass. Optimization Alignment usually yields better upper bounds on minimum tree length (e.g., cladogram) than multiple alignment methods. Another method is a Fixed-Character State optimization method, which estimates internal nodal sequences by requiring they be drawn from the set defined by the terminals. In general, this yields less satisfactory cladogram lengths. In each of these procedures, a single hypothetical ancestral sequence is generated for each internal node of the cladogram. However, the procedures require large amounts of computational resources.