The field of bioinformatics lies at the intersection of computer science and molecular biology. Among other things, it deals with methods of processing and analysing genomic and proteomic information.
For the first time in our natural history, we have access to complete genomic sequences of H. sapiens, C. elegans, A. thaliana, D. melanogaster, M. musculus, S. pombe, S. cerevisiae, rice, dozens of prokaryote genomes, and hundreds of virus genomes (the initial sequences of the human genome, for example, may be found at the following references: International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome, Nature 409, pp. 860-921, 2001, and J. C. Venter et al., The sequence of the human genome, Science 291, p. 1304, 2001). However, the potential of this enormous and exponentially growing wealth of information will be wasted if proper tools to mine it are not developed.
One class of crucial tools is homology search programs for finding similar regions within one or between two DNA sequences. Genomics studies routinely depend on such homology search tools. It is not surprising that many algorithms and programs have therefore been developed for the task, including the following:                FASTA; see D. J. Lipman, W. R. Pearson, Rapid and sensitive protein similarity searches, Science 227, pp. 1435-1441 (1985);        SIM; see X. Huang and W. Miller, A Time-efficient, Linear-Space Local Similarity Algorithm, Advances in Applied Mathematics, 12, pp. 337-357 (1991);        the Blast (Basic Local Alignment Search Tool) described by S. F. Altschul, W. Gish, W. Miller, E. Myers and D. J. Lipman, Basic local alignment search tool, J. Mol. Biol., 215, 403-410 (1990); and the family of related tools that it spawned, including WU-Blast, Psi-Blast, MegaBlast and BL2SEQ;        SENSEI; see a description by D. States on the SENSEI Web site at: http://stateslab.wustl.edu/software/sensei/;        MUMmer; see A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, Alignment of whole genomes, Nucleic Acids Research, 27:11, 2369-2376 (1999);        QUASAR; see S. Burkhardt, A. Crauser, H-P. Lenhof, E. Rivals, P. Ferragina and M. Vingron, q-gram based database searching using a suffix array, 3rd Ann. International Conference on Computational Molecular Biology, Lyon 11-14, April 1999; and        REPuter; see S. Kurtz and C. Schleiermacher, REPuter—Fast computation of maximal repeats in complete genomes; Bioinformatics, 15:5, 426-427 (1999).        
These existing search tools are far from adequate to handle the amount of biological sequences currently available. For example, the best program currently available (Blast) would take almost 19 CPU-years to compare the human genome and the mouse genome on a modern personal computer. Other examples of the excessive times these routines require to perform a search are presented in Table 1 and Table 2 included hereinafter. Despite the slowness, Blast's sensitivity is not great, that is, it would miss many similarities for the reasons explained hereinafter.
Clearly then, more sensitive and more efficient homology search tools are urgently needed.
Given two long DNA sequences, exhaustively comparing all bases against all bases is well-known to be too slow. However, two approaches have been used to improve the situation. The first is exemplified by Blast, which is used routinely by thousands of scientists. In this approach a match of two short substrings of the two long DNA sequences is called a “seed match”, or a “hit”. The approach finds all the hits and tries to extend the hits into longer alignments. However, when comparing two very long sequences, FASTA, SIM, Blastn (BL2SEQ), WU-Blast, and Psi-Blast run very slowly and need large amounts of memory. SENSEI and MegaBlast try to improve the running speed by sacrificing quality. MegaBlast, at its large seed length of 28, outputs low quality alignments. SENSEI does not even do gapped alignments (a gap is a series of spaces inserted to one of the two sequences; in order to obtain a good alignment, very often several gaps need to be inserted into the two sequences). Thus, it is desirable to improve the quality of hits, and reduce the running time for an analysis.
Programs that depend on the strategy of finding short seed matches which are then extended, will be referred to herein as “Blast-type” programs. Blast-type programs exhibit a tradeoff between sensitivity and speed according to the chosen seed size. That is, increasing seed size reduces the time it takes to process a search, but it also decreases sensitivity (which means that it misses sequence matches).
Another approach, exemplified by MUMmer, QUASAR and REPuter, is based on suffix trees. Suffix trees are standard data structures in Computer Science. A suffix tree is used to build an index table for a target string in order to find the exact match of a query string efficiently. The technique of finding sequence matches using suffix trees suffers from two major problems:    1. they are meant to deal with precise matches and are limited to comparison of highly similar sequences. They are very awkward in handling mismatches because the suffix tree is not designed to allow for mismatches in the sequence; and    2. they have an intrinsic large space requirement.Due to these obstacles, it is not expected that this approach will lead to practical homology software with quality comparable to Blast-type algorithms.
In similarity searching, not only exact matches of short strings can be used as seeds (as short matches can be used to find longer alignments). A number of techniques using other kind of matches as seeds have been proposed, but all have serious shortcomings. For example:    1. locally-sensitive hashing (LSH) described by P. Indyk and R. Motwani in: Approximate nearest neighbors: towards removing the curse of dimensionality, Proc. 30th Ann ACM Symp. Theory Comput., 1998, Dallas, Tex., has been applied to ungapped homology search in J. Buhler, Efficient large-scale sequence comparison by locality-sensitive hashing, Bioinformatics, 17, 419-428 (2001). LSH is a random hashing/projection technique unsuitable for gapped homologies.            In Buhler, in each of hundreds of iterations, a newly chosen random hash function is applied to every region of a fixed size (of about 100), and regions mapping to the same value are fully compared. Similar overlapping regions are then merged into ungapped alignments. However, a long ungapped alignment can only be found if the regions found to be similar cover its whole length;            2. earlier than Buhler, a similar idea had been applied in Flash (see A. Califano and I. Rigoutsos, FLASH: fast look-up algorithm for string homology, Tech. Rep., IBM T. J. Watson Research Center, 1995), which used shorter regions. Both approaches focused on covering a homology entirely with hits, instead of doing hit-extension in Blast style. The Flash authors tried to use many seeds to cover a region, using “randomly” generated tuples. Flash is aimed at fully covering an ungapped region which is less efficient than using Blast-style hit extensions;    3. in two other references, the proposal is made to use periodically spaced probes in sequencing by hybridization studies (see F. P. Preparata, A. M. Fieze, and E. Upfal, On the power of universal bases in sequencing by hybridization, RECOMB, 1999, pp. 295-301 and F. P. Preparata and E. Upfal, Sequencing-by-hybridization at the information-theory bound: an optimal algorithm, RECOMB, 2000, pp. 245-253). However, all of their seeds have a predetermined pattern, 1s(0s−11)u for some s and u (where 1s means that “s” consecutive characters must match, and 0s−1 means that “s−1” consecutive characters do not match, etc.).            Clearly, these predetermined seed patterns are not optimal for general homology searches. Thus, given an arbitrary homology problem, this methodology will offer no improvement in processing speed or performance;            4. several programs, including SENSEI, Exonerate, and Blastn, may allow a mismatch in consecutive length k-seed matches. This has the same performance as the use of k seeds with pattern 1i−101k−i. However, because the k seeds are quite dependant to each other, the use of them together will slow down the search significantly, yet provide very limited improvement on the sensitivity;    5. another program called BLAT developed by Jim Kent uses seeds with predetermined patterns such as 110110110, with 0's in every third position (i.e we do not care whether there is a match in the 0 locations). This particular pattern is used in coding region analysis where the third position is simply not as important as the first two.            In other words, Kent is merely teaching that the seed be designed to search for what the user wishes to find. Kent's teachings are therefore of no assistance in solving the general search problem, where the user does not know where the mismatches will lie (and, for the sake of the general search problem, does not care where the mismatches lie). Thus, this approach is basically the same as the consecutive seed scheme of Blast and it does not optimize the probability of a hit.        
Thus, all of the above attempts at handling local gapped alignments employ either random hash functions, and/or multiple predetermined patterns. As explained above, they cannot offer any improvement in both the sensitivity and the speed of the general homology search.
There is therefore a need for means of improving homology searching, provided with consideration for the problems outlined above.