Phylogenetic analysis using biological sequences can be divided into two groups. The algorithms in the first group calculate a matrix representing the distance between each pair of sequences and then transform this matrix into a tree. In the second type of approach, instead of building a tree, the tree that can best explain the observed sequences under the evolutionary assumption is found by evaluating the fitness of different topologies.
Some of the approaches in the first category utilize various distance measures which use different models of nucleotide substitution or amino acid replacement. [2] [28] [30] [32] [34] The second category can further be divided into two groups based on the optimality criterion used in tree evaluation: parsimony [8] [11] [13] [19] and maximum likelihood methods [15] [16] [18].
All of these methods require a multiple alignment of the sequences and assume some sort of an evolutionary model. In addition to problems in multiple alignment (computational complexity and the inherent ambiguity of the alignment cost criteria) and evolutionary models (they are usually controversial), these methods become insufficient for phylogenies using complete genomes. Multiple alignment becomes misleading due to gene rearrangements, inversion, transposition and translocation at the substring level, unequal length of sequences, etc. and statistical evolutionary models are yet to be suggested for complete genomes. On the other hand, whole genome-based phylogenetic analyses are appealing because single gene sequences generally do not possess enough information to construct an evolutionary history of organisms. Factors such as different rates of evolution and horizontal gene transfer make phylogenetic analysis of species using single gene sequences difficult.
To overcome these problems, Sankoff et al. (1992) [51] defined an evolutionary edit distance as the number of inversions, transpositions and deletions or insertions required to change the gene order of one genome into another. Similar distance measures using rearrangement, recombination, breakpoint, comparative mapping and gene order have been extensively studied for applications to genome-based phylogeny. [6] [7] [23] [24] [29] [30] [31] [48] [49] [50] However, these approaches are computationally expensive and do not produce correct results on events such as non-contiguous copies of a gene on the genome or non-decisive gene order (as in mammalian mtDNA where genes are in the same order).
Gene content was proposed by Snel et al. (1999) [52] as a distance measure in genome phylogeny where the similarity between two species is defined as the number of genes they have in common divided by their total number of genes. The general idea is further extended to identify evolutionary history and protein functionality. [20] [27] [38] [53] [54] Lin and Gerstein (2000) [38] constructed phylogenetic trees based on the occurrence of particular molecular features: presence or absence of either folds or orthologs throughout the whole genome. Takaia et al. (1999) [55] used whole proteome comparisons in deriving genome phylogeny, taking into account the overall similarity and the predicted gene product content of each organism. However, such methods fail to work when the gene content of the organisms are very similar (again as is the case with mammalian mtDNA where the genomes contain exactly the same genes).
In the early 1990s, various data compression approaches were applied to the analysis of genetic sequences. [14] [21] [22] [41] [45] Data compression algorithms function by identifying the regularities in the given sequence, and in case of DNA sequences, these regularities would have biological implications. Grumbach and Tahi (1993, 1994) [21] [22] coded exact repeats and palindromes in DNA along the lines of Lempel-Ziv (LZ) compression scheme [59] and used an arithmetic coder of order 2 when such structures are lacking. Rivals et al. (1994, 1996) [44] [45] compressed the repeats which introduced a significant compression gain and introduced a second compressor which made use of approximate tandem repeats. Rivals et al. (1997) [46] also introduced a compression algorithm which locates and utilizes approximate tandem repeats of short motifs. Some of the later approaches include Loewenstern and Yianilos, 1999; Lanctot et al., 2000; Apostolico and Lonardi, 2000. [1] [35] [39] Grumbach and Tahi (1994) noted that the compression rate obtained by compressing sequence S using sequence Q would hint at some sort of a distance between the two sequences. [22] Although the proposed distance was not mathematically valid and had some other problems, it applied data compression to phylogeny construction.
Varre et al. (1999) [57] defined a transformation distance where sequence S is built from sequence Q by segment-copy, -reverse-copy and -insertion. The total distance is the Minimum Description Length among all possible operations that convert S into Q. This distance, as the one provided by Grumbach and Tahi (1994) [22], is asymmetric. Chen et al. (2000) [12] described a compression algorithm (GenCompress) based on approximate repeats in DNA sequences. The program is then used to approximate the distance proposed therein and the distance proposed by Li et al. (2001). [36] Ziv and Merhav (1993) [4] and Bennett et al. (1998) [6] provide a detailed analysis of information distance in statistical and algorithmic settings.
The distance proposed by Chen et al. (2000) and Li et al. (2001) is 1−[K(S)−K(S|Q)]/K(SQ), where K(S) is the Kolmogorov complexity of S, K(S|Q) is the conditional Kolmogorov complexity of S given Q and K(SQ) is the Kolmogorov complexity of the sequence S concatenated with Q. K(S|Q) is the shortest program that outputs S when the input is Q on a universal computer and K(S) is K(S|_), where _ is the empty string. [12] [33] Kolmogorov complexity is an algorithmic measure of information (Li and Vitanyi, 1997) but it is a theoretical limit and generally can only be approximated. [37] In calculating the aforementioned distance, K(MQ) is approximated by the length of the compressed result of S (using the program GenCompress) given Q.
Benedetto et al. (2002) [3] used a similar idea where relative complexity between sequences S and Q is approximated as it is done by Chen et al. (2000) [12], this time using gzip. However, both gzip and GenCompress are complicated programs, composed of multiple complex steps (algorithms to reduce search space, find exact/approximate matches, perform entropy coding, etc.), which would affect the final result on the complexity estimates in an ambiguous way. Therefore the properties of the distance measures based on Kolmogorov complexity (implicitly or explicitly) would not necessarily hold for these approximations depending on the performance of the compression algorithms on certain sequences.
Methods that rely on the compressibility of a sequence using a compression package have an inherent flaw as these are complicated programs, composed of multiple complex steps (algorithms to reduce search space, find exact/approximate matches, perform entropy coding, etc.), which would affect the final result on the complexity estimates in an ambiguous way. Therefore the properties of the distance measure based on Kolmogorov complexity (implicitly or explicitly) would not necessarily hold for these approximations and the resulting distance may be misleading depending on the performance of the compression algorithms on certain sequences.
Traditional methods are based on phenotypic identification of organisms following the use of culture techniques. Clinical microbiology is currently undergoing a major transition to the use of molecular approaches. However, molecular approaches require the operator to select from among a list of probes or amplification primers for the identification process to proceed. In other words, the operator must have some predetermined idea as to the name or nature of the organism to be identified.
What would be beneficial is a system and method for phylogeny construction that does not require multiple alignment and is fully automatic. It would also be beneficial not to have to use approximations and assumptions in calculating the distance between sequences.