Researchers in the Biotechnology industry are increasingly working with very large DNA databases. For example, the human genome is approximately 3 gigabases. Searching these databases has traditionally been done with dedicated servers because the search algorithms require substantial computer resources. As an alternative analysis tool, desktop computers are often more versatile and convenient, and they are now routinely equipped with hundreds of gigabytes (GB) of hard disk space and several GB of RAM (random access memory). The challenge is to harness this capacity by creating DNA analysis software that works efficiently in a multipurpose desktop environment.
A basic requirement for DNA analysis software is rapid searching of a DNA database to find all exact matches for a query sequence. The desired search speeds can only be achieved by indexing the database. One well-characterized indexing strategy is to generate a suffix tree (Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, 1997, incorporated herein by reference). Although suffix trees have been used productively for some molecular biology applications, such as aligning whole genomes (Kurtz, S., A. Philippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. L. Salzberg, Versatile and open software for comparing large genomes. Genome Biol. 5: R12, 2004, incorporated herein by reference), they consume large amounts of memory, up to 15 bytes or more per base. More compact than suffix trees are suffix arrays, which can provide similar search capabilities while requiring only 4-8 bytes per base (Abouelhoda, M. I., S. Kurtz, and E. Ohlebusch, Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2: 53-862004, 2004, incorporated herein by reference).
Non-suffix-based indexing strategies are currently in more widespread use for DNA databases. The SSAHA algorithm divides a DNA sequence into nonoverlapping K-mers of K consecutive bases (a K-mer is an oligonucleotide of length K), and stores the position of these K-mers in a hash table (Ning, A., A. J. Cox, and J. C. Mullikin, SSAHA: a fast search method for large DNA databases. Genome Res. 11: 1725-1729, 2001, incorporated by reference). A similar indexing method is used by the BLAT algorithm (Kent, W. J, BLAT—The BLAST-like alignment tool. Genome Res. 12: 656-664, 2002, incorporated by reference). Both SSAHA and BLAT generate small indexes, on the order of 1 byte or less per base, and they can be orders of magnitude faster than BLAST or FASTA, which index the query sequence rather than the database (Pearson, W. R. and D. J. Lipman, 1988, Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85: 2444-2448; Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, 1990, Basic local alignment search tool. J. Mol. Biol. 215: 403-410; Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402, each reference incorporated by reference). SSAHA and BLAT have proven to be powerful for applications such as mapping sequence reads to a genome, or aligning mRNA sequences with the corresponding genomic DNA sequences. However, SSAHA and BLAT have limitations. Unlike suffix-based algorithms, which can identify all matches to any query sequence, SSAHA cannot detect a match of fewer than K bases, and requires 2K−1 consecutive matching bases to guarantee that a match will be registered. Because SSAHA sorts the search results, efficient searching is achieved by ignoring the K-mers that occur most frequently in the database. Similarly, BLAT sacrifices completeness for speed.
These various algorithms have generally been designed with the assumption that the complete index of a DNA database will be stored in main memory. Such algorithms are inconvenient for desktop applications, because an index might occupy much or all of the memory of a typical personal computer.