1. Field of the Invention
This invention relates to the field of searching strings of tokens in a database. Specifically, the invention finds sequences of tokens in a database similar or identical to a predefined reference sequence of tokens.
2. Description of the Prior Art
There are many conventional techniques for finding the occurrence of a particular sequence of tokens, called a reference sequence or reference string, within a database of many strings of tokens. (A token is a symbol, such as a letter, word, sound, bit pattern, or other descriptive designation which for our purposes can appear in a sequence with other tokens.) Some of these techniques were developed to perform specific tasks, e.g. finding an exact or similar sequence of specific tokens, e.g., nucleotides (or amino acids), in a long string of nucleotides (or amino acids) comprising a DNA (or protein) molecule. (Two sequences are similar if they can be made identical by inserting, deleting, or modifying less than a preset number of tokens in one of the sequences.) Some of these conventional matching techniques include: the Needlemann-Wunsch or the original Wilbur-Lipman algorithms, FASTA, FASTP and BLAST.
The Needleman-Wunsch algorithm is a dynamic programming technique. All tokens in the two sequences to be compared are considered pairwise to compute all possible candidate alignments between the two sequences. A cost value is associated to deletions, insertions and modifications. The alignment that produces the smallest global cost value is then chosen. This is an expensive technique since the amount of computation required is proportional to the product of the length of the two sequences to compare.
The Wilbur-Lipman algorithm compares contiguous tuples of small length in the original and reference strings. Tuples are matched for both sequences using a look-up table that is created from the reference string. The score for each candidate match is computed and the best score is selected. A new look-up is therefore created each time a new reference sequence must be compared against the database. Since the entire set of original string must be checked against the look-up table the amount of computation required to match against a database containing a total of 2N nucleotides or amino acids will be double that required for a database with only N nucleotides or amino acids. In other words, the number of comparisons against the look-up table required is at least equal to the total number of nucleotides (amino acids) present in all the original strings.
The FASTP and FASTA algorithms are refinements of the original WILBUR-LIPMAN technique. Increased sensitivity is achieved by means of a replaceability matrix to score the alignments. Mutations that appear frequently in evolution (deletions, insertions, and replacements of nucleotides) are given a better scores, while less frequent ones are given worst scores. The nature of the approach, however is still sequential.
The BLAST technique does an in-depth comparison of the original and reference sequence only if they satisfy an initial minimal similarity test which can be performed very quickly. This is done by heuristically determining whether the length of the MSP (maximal segment pair) is above a given threshold. The MSP is the pair of identical length substrings of the reference string and sequence string that has the best score for mutations. If this test is successful a more complete and costly similarity analysis is performed using FASTP-FASTA type algorithms. This reduces the amount of computation at risk of missing some matches that do not satisfy the initial criteria. About 20% of the similarities detected with the Needleman-Wunch algorithm are not picked up by BLAST. Also the approach remains inherently sequential since some computation must be performed for each token in the set of original strings.
3. Statement of Problems with the Prior Art
The prior art has been successful in efficiently comparing two token sequences (the tokens specifically being nucleotides or amino acids) on a one-to-one basis, i.e., sequentially. However, much of the prior art has difficulty in finding all or even most of the possible matches of a reference sequence of tokens in a database of original token strings without performing some computation on each or most tokens in the original sequences. Current computer technology is unable to perform these tasks on very large databases within a reasonable amount of time.
Accordingly, there has a long felt need for an indexed method of determining a similar or exact match between a reference string of tokens and a sequence of tokens in one or more original strings of tokens in very large databases. There has also been a need to quickly and efficiently determine the location of these similar or identical sequences on the original string of tokens and the degree of similarity these strings have to a reference string. Specifically, in the area of genome mapping, these has been a long felt need for a method, using current computer technology, to detect similarities among nucleotide sequences in a database containing up to 4 billion nucleotides.
The prior art fails to quickly and efficiently locate sequences of tokens on original strings in large databases when searching for a match to a reference string of tokens. This is because the prior art must scan the entire database of original strings in the matching procedure. The prior art must scan the entire database to locate matching strings of tokens, because it fails to provide a indexing technique that quickly and accurately identifies only those original strings that contain possible match sequences of tokens.