In this section, we first survey utility of comparative DNA sequence analysis, and then we survey the comparison methods in the prior art, pointing out the problems inherent in the prior art.
Utility of Large-Scale DNA Sequence Comparisons
The utility of large-scale DNA sequencing and comparative analysis is such that the scientific debate is turning away from the question of whether to sequence genomes of additional organisms to the question of which organism should be next selected for sequencing (1, 2, 3, 4, 5). Cross-species comparisons of large genomic regions (8, 9, 10, 11, 12, 13), chromosomes (14), and whole genomes (15) provide a clear demonstration of the power of genome comparison tools in detecting coding exons in genomic DNA sequences of higher organisms and for comparative sequence assembly (16). Recent discoveries (17, 18) further validate early proposals (19, 3) to use the comparative approach for the discovery of up-stream regulatory elements.
Comparative studies at different evolutionary distances reveal biological systems of different degrees of evolutionary age and conservation. For example, intra-mammalian comparisons such as that between human and mouse reveal the evolution and structure of the olfactory system (14) and of imprinting (20), and are generally considered to be one of the main sources of biomedical discoveries in the post-genome era (21). Intra-vertebrate comparisons help us understand the system of chromosomal sex determination (22), while the comparisons across a variety of eukaryotes, such as that between D. melanogaster, C. elegans, and S. cerevisiae, shed light on mechanisms involved in even more basic cellular and developmental processes (15). These comparative studies reveal that a significant number of human genes involved in disease have homologues even in unicellular eukaryotes such as S. cerevisiae (15), thus underscoring the power of the comparative method even when it is applied across large evolutionary distances. Comparative sequence analyses have also proven very fruitful in the understanding of microbes. Recent comparative analyses further reinforce the picture of very plastic genomes that readily take up foreign DNA thus acquiring properties such as antibiotic resistance and virulence (23).
Comparative studies do not only include sequence-level comparisons, but also mapping of large-scale genomic rearrangements. Such comparisons use not only finished genomic sequences, but also a variety of other comparative mapping techniques (24, 25, 26). Recent sequencing of the human genome and comparative analyses of whole chromosomal sequences (14, 27) reveal a rich structure of intra- and inter-chromosomal and repetitive rearrangements. Complex repetitive structures appear to play a significant role in genomic rearrangements of significant clinical relevance (28).
Large-scale sequence comparisons are used in the overlap-detection stage of the standard sequence assembly method (29). Moreover, comparison of unassembled reads from one species (e.g., rat) against an assembled genome of another species (e.g., human) can facilitate comparative assembly, thus significantly reducing the total number of sequencing reads and reducing the time to cross-species comparison (16). The computational resources required for comparative assembly (i.e., the multi-CPU-year time required for comparison of multiple mammalian genomes by present methods) have been the main obstacle to implementing this advanced method.
New methods for exchanging annotation and comparative information are based on the Distributed Annotation System (DAS (30)) and are now increasingly used. A major emerging problem is that different assemblies of genomic sequences of the same organism are typically annotated; such assemblies must be compared in order to integrate the annotation information around the assembly of highest quality. Thus, effective methods for large-scale sequence comparison are becoming a bottleneck in the process of utilizing and sharing annotation information.
Problems With Current DNA Sequence Comparison Methods
While the programs such as BLAST (31) and FASTA are well-suited for querying large databases using a limited number of query sequences, they are too slow to handle whole-genome comparisons such as that between complete genomic sequences of human, mouse, rat. They are even less suited to comparing assembled genomes against a rapidly growing database of unassembled reads in publicly available trace archives.
In order to speed the performance of the standard methods, computer clusters and computer farms, often employing thousands of CPUs have been employed. Among most widely used are farms of Intel machines under Linux operating system that utilize the PBS job scheduling system.
The mere addition of hardware does not, however, solve the crucial problem of standard methods: the time requirement that grows quadratically with the number of sequences that are compared in an all-against-all fashion. By multiplying CPUs, the quadratic time is simply divided by a larger number of machines, thus only partially ameliorating the problem at a significant computer hardware cost.
Another approach to solving the problem is pursued by projects such as SSAHA at EBI/Sanger Centre, BLAT at UCSC, and ATLAS at Baylor College of Medicine. These new approaches are all based on constructing an in-memory hash table that serves as a quick lookup index for constant-time similarity search. The main problem with such hashing-based approaches is that the whole hash table must reside in Random Access Memory (RAM) of a single computer. This implies lack of parallelism and requirement for very large RAM.
Two solutions to this problem are typically attempted. The first is to simply employ a single machine with large RAM (>20 GB). The main problem with this solution is that it does not utilize parallelism available on computer clusters, and thus does not scale well to multi-genome comparison problems.
A second solution that is often attempted is to break the sequence database into a number of subsets and to construct hash tables for each subset. The problem with this solution is a time requirement that grows quadratically with the size of the data set (albeit with a smaller constant factor than that encountered by the standard approaches such as BLAST and FASTA).
Thus, despite considerable effort being focused on this problem, there remains a significant need in the art to provide more efficient methods of data manipulation.