Comparison of unidentified sequences present in a sample (e.g., biological sample) against known sequences can provide useful information about the nature of the sequence. For instance, unidentified nucleic acids and/or proteins present in a biological sample can be sequenced and compared against known sequences to determine the identity and function of the nucleic acid and/or protein, as well as its source (i.e., organism). In this way, a sample containing an unknown mixture of nucleic acids and/or proteins obtained from an individual can be analyzed to determine whether any of the nucleic acids and/or proteins in the sample is foreign.
Existing sequence databases are already vast and continue to grow at an astonishing pace. For example, the human genome project and other similar sequencing initiatives have resulted in an enormous amount of sequence information available in both private and public databases. Sequence similarity searching is used both to compare a single sequence against the sequences in a single database, and is also used to compare many new sequences against multiple databases. Furthermore, sequence alignment and database searches are performed worldwide thousands of times each day. Therefore, the importance of rapidly and accurately comparing new sequence data against such sequence databases is increasing.
Various programs and algorithms are available for performing database sequence similarity searching. For a basic discussion of bioinformatics and sequence similarity searching, see BIOINFORMATICS: A Practical Guide to the Analysis of Genes and Proteins, Baxevanis and Ouellette eds., Wiley-Interscience (1998) and Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids, Durbin et al., Cambridge University Press (1998). The FASTA program was among the first algorithms used for conducting sequence alignment searching. (Lipman and Pearson, “Rapid and sensitive protein similarity searches,” Science, Vol. 227, PP. 1435-1441 (1985); Pearson and Lipman, “Improved tools for biological sequence comparison,” Proc. Natl. Acad. Sci., Vol. 85, pp. 2444-2448 (1988)). The FASTA program conducts optimized searches for local alignments using a substitution matrix. The program uses “word hits” to identify possible matches and then performs the more time-consuming optimization search. Another popular algorithm for sequence similarity searching is the BLAST (Basic Local Alignment Search Tool) algorithm, which is employed in programs such as blastp, blastn, blastx, tblastn, and tblastx. (Altschul et al., “Local alignment statistics,” Methods Enzymol., Vol. 266, pp. 460-480 (1996); Altschul et al., “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs,” Nucl. Acids Res., Vol. 25, pp. 3389-3402 (1997); Karlin et al., “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci., Vol. 87, pp. 2264-2268 (1990); Karlin et al., “Applications and statistics for multiple high-scoring segments in molecular sequences,” Proc. Natl. Acad. Sci., Vol. 90, pp. 5873-5877 (1993)). The BLAST program identifies segments, optionally with gaps, that are similar between a query sequence and a database sequence, evaluates the statistical significance of all identified matches, and selects only those matches that satisfy a preset significance threshold.
More recently, a DNA sequence searching algorithm referred to as SSAHA reportedly conducts DNA sequences searches as many as three to four orders of magnitude faster than FASTA and BLAST has been described (Mullikin et al, “SSAHA: A Fast Search Method for Large DNA Databases,” Genome Research, Vol. 11, pp. 1725-1729 (2001)). The SSAHA algorithm is based on organizing the DNA database into a hash table data structure using a two-bits-per-base binary representation. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k continuous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching a query sequence in the database is performed by obtaining from the hash table the “hits” for each k-tuple in the query sequence and then sorting the results to identify exact matches between k-tuples which can be concatenated to identify the sequences in the database which match the query sequence.
However, SSAHA experiences some shortcomings. For example, although the two-bits-per-base representation employed is compact and efficient to process, it has the disadvantage that it is impossible to encode any characters apart from four valid bases (i.e., A, C, G, T). Therefore, while creating hash tables SSAHA requires a user to choose between either ignoring any unrecognized characters entirely and translating unrecognized characters into one of the four bases, resulting in potentially diminished sensitivity. Moreover, SSAHA is ill-equipped for searching protein sequences due to its two-bits-per-base representation. As another example, the SSAHA algorithm's sensitivity is limited by k-tuple size. That is, under no circumstances can the SSAHA algorithm detect a match of less than k consecutive matching base pairs between a query sequence and a subject sequence. Actually, SSAHA requires 2k−1 consecutive matching bases to ensure that the algorithm will register a hit in the matching region. Thus, for a k-tuple size of 15, SSAHA requires 29 consecutive matching bases to ensure that the algorithm will detect a hit in the matching region. In comparison, the default settings of FASTA and BLAST require at least 6 and 12 base pairs, respectively, to detect a match. SSAHA can be adapted to increase sensitivity by allowing one substitution between k-tuples at a cost of approximately 10-fold decrease in search speed. In addition, by modifying the hash table generation code so that k-tuples are hashed base-by-base SSAHA can be adapted to guarantee that any run of k consecutive matching bases will be detected by the algorithm, at a cost of a k-fold increase in CPU time and in the size of the hit list L for a given k. In other words, for SSAHA's ideal k-tuple size of 15, increasing the sensitivity of SSAHA in this manner would result in a 15-fold increase in CPU time. Finally, SSAHA builds hash tables and stores them in active memory (RAM) every time the algorithm runs, which means that the preprocessing step of generating hash tables is performed every time a query is processed, resulting in reduced query processing speed.