It is no secret that the amount and types of information that can be accessed by data processing systems increases at a torrid rate. As the amount of available data increases, so too does the need for technologies that can recognize patterns in data. Indeed, pattern recognition is a recognized research discipline within computer science, devoted to studying the operation and design of systems that recognize patterns in data. It encloses subdisciplines like discriminant analysis, feature extraction, error estimation, cluster analysis (together sometimes called statistical pattern recognition), grammatical inference and parsing (sometimes called syntactical pattern recognition). Important application areas are found in image analysis, character recognition, speech analysis, application areas are found in image analysis, character recognition, speech analysis, man and machine diagnostics, person identification, industrial inspection, and analysis of molecular and/or biological sequences.
One common application of pattern recognition techniques is the analysis of data structures that consist of a sequence (or array) of data values, as compared to other such sequences. Sequence analysis, especially as it pertains to molecular biology, involves searching for similarities between some number of relatively small “needle” or “query” sequences and a typically much larger “haystack” or “subject” sequence. A sequence is a series of values, typically bytes, whose aggregate value has a physical basis. For example, a sequence of amino-acid identifiers bytes may describe a complete protein. Likewise, a sequence of nucleic-acid identifiers may describe the DNA make-up of a chromosome or portion thereof. As another example, in the case of speech analysis, data values in the sequence data may represent the phonemes that make up a series of spoken words.
The most commonly used program for biological sequence analysis is the so-called BLAST (Basic Local Alignment Search Tool), however there are other similar programs. The core BLAST heuristic matching algorithm and a number of programs that use the algorithm are in the public domain and administered by the National Center for Biotechnology Information (NCBI) as described at http://www.ncbi.nih.gov. While the discussion of examples in this document uses the NCBI BLAST integration of biological sequence information as a principal example, it should be understood that the principals discussed herein are suitable for integration with other similar algorithms and/or for other types of data such as speech or image data. Note that the common terms in the biological community are “subject sequence” (to refer to the long sequence) and “query sequence” (to refer to the shorter sequence) rather than “haystack sequence” and “needle sequence”, respectively. This document avoids these more standard terms because the word “query”, at least when used by itself, has a different meaning in the relational database system art.
A given needle sequence can be similar to a given haystack sequence in several places. Each site of similarity is considered a “local alignment”.
Executing a BLAST program for “N” needle sequences against a haystack of “H” sequences results in a description of each of the independent areas of local similarity between every needle and every haystack sequence. Thus, the number of result descriptions can significantly exceed “N×H” values, but the number reported is usually much less because it is limited to those similarities considered statistically significant by the BLAST algorithm.
It is also known that relational databases are used to store and analyze typically large amounts of information. Modern relational databases provide the user with a powerful query language, such as SQL-92 (Structured Query Language, ANSI version 92) to perform analysis and reporting of the data stored in the database system. Data analysis typically involves searching, grouping, counting and relation-joining operations
Molecular sequence analysis requires a large amount of processing resources and the compute time is often excessive as compared to the amount of time desired by the user—sometimes measured in hours or days. Part of this time is typically performed converting sequence formats from stored format to computationally convenient formats and back and also computing other information not ultimately required by the user.