1. Field of the Invention
The invention relates generally to systems and methods of efficiently searching for desired patterns in sequential data and more specifically to sequential data with broad application to a variety of fields, including but not limited to bioinformatics, molecular biology, pharmacogenomics, phonetic sequences, lexicographic sequences, signal analysis, game playing, law enforcement, biometrics, medical diagnosis, equipment maintenance, and micro-array data analysis.
2. Background Information
Exact and approximate string matching are two of the main techniques used in applications such as text searching, computational biology or bioinformatics, and pattern recognition.
A natural extension includes multi-pattern searching, or searching for several patterns simultaneously in order to report all occurrences with a limited number of differences. Similarly the method may be extended to multi-modal pattern detection that includes searching several different data bases with a set of patterns each of which is appropriate to the particular data base. This has several applications including biometrics, virus and intrusion detection, spelling applications, text retrieval under synonym or thesaurus expansion, several problems in computational biology, and batch processing of single-pattern approximate searching.
The field of bioinformatics includes the systematic development and application of information technologies and data processing techniques for collecting, searching, analyzing and displaying data obtained by experiments to make observations concerning biological processes. Bioinformatics is concerned with the use of computing in biological research areas such as genomics, transcriptomics, proteomics, genetics and evolution (see, for example, Goodman, Current Opinion in Biotechnology, 2002, 13:1:68-71).
High-throughput sequencing projects have generated complete genome sequences for scores of microbes and several eukaryotes, including human. Successful achievement of genome projects have yielded complete genomic sequences of several species, including H. sapiens, C. elegans, A. thaliana, D. melanogaster, M. musculus, S. pombe, S. cerevisiae, rice, dozens of prokaryote genomes, and hundreds of virus genomes (the initial sequences of the human genome, for example, may be found at the following references: International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome, Nature 409, pp. 860-921, 2001, and J. C. Venter et al., The sequence of the human genome, Science 291, p. 1304, 2001).
With this explosive growth of biological sequence data, biological information is less frequently published in the conventional way via a publication in a scientific journal, but instead deposited into a database. In the last two decades these databases have become essential tools for researchers in biological sciences. Such databases are generally classified according to the type of sequence information they contain. Basic types of sequence-related databases include nucleic acid sequences, amino acid sequences representing polypeptide primary structures, and protein tertiary structures, as well as various specialized data collections.
Biologists use these comprehensive data in their attempts to discover the biological functions of genes and the proteins they encode. For many proteins, for example, it is possible to make inferences of function based simply on recognizable similarity with previously characterized sequences. Currently, between one-third and one-half of the genes in newly sequenced genomes can be annotated on the basis of recognized similarity to genes of other organisms. Furthermore, as more genes are characterized, a greater fraction of new and extant genomes can be annotated through similarity searches.
The ability to make valid inferences based on sequence similarity depends on the relationship between sequence, structure and function—all of which revolve around the imputation of common ancestral roots or homology.
Given this basis for sequence similarity, divergence between homologous sequences shows a consistent rate pattern based on the nature of evolution. Protein function mutates and evolves more slowly than protein structure and protein structure evolves more slowly than gene and amino acid sequences.
While the usual method for detecting homology is sequence comparison, effective annotation of unknown genes requires that we have some ability to determine functional similarities. While finding sequence similarity is meaningful, finding structural similarity brings us closer to our aim, which is the accurate determination of a genetic function. Given a similarity-finding technique that provides sufficient clues about structure—and thereby function—we can use those clues to suggest experiments, form hypotheses, and thereby pursue further characterization of unknown proteins.
Because similarity-finding has such a central role in annotating existing and newly sequenced genomes, many methods have been developed including the following:
BLAST (Basic Local Alignment Search Tool) described by S. F. Altschul, W. Gish, W. Miller, E. Myers and D. J. Lipman, Basic local alignment search tool, J. Mol. Biol., 215, 403-410 (1990); and the family of related tools that it spawned, including WU-Blast, Psi-Blast, MegaBlast and BL2SEQ;
SENSEI; see a description by D. States on the SENSEI world site at the hypertext transfer protocol “stateslab.wustl.edu/software/sensei/”;
MUMmer; see A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, Alignment of whole genomes, Nucleic Acids Research, 27:11, 2369-2376 (1999);
QUASAR; see S. Burkhardt, A. Crauser, H—P. Lenhof, E. Rivals, P. Ferragina and M. Vingron, q-gra based database searching using a suffix array, 3rd Ann. International Conference on Computational Molecular Biology, Lyon Apr. 11-14, 1999; and
REPuter; see S. Kurtz and C. Schleiermacher, REPuter—Fast computation of maximal repeats in complete genomes; Bioinformatics, 15:5, 426-427 (1999).
In addition to the field of bioinformatics, sequential data is becoming increasingly voluminous in other disciplines, thereby requiring efficient methods and systems for processing the data. One such example is the area of personal identification by biometric data (for example, fingerprints, hand geometry, iris, retina, signature, voiceprint, facial thermogram, hand vein, gait, ear, odor, keystroke dynamics, etc.). Homeland Security, electronic banking, e-commerce and smartcards, along with increased emphasis on the privacy and security of information stored in various databases, lead to the generation of massive amounts of sequence data as well as a need for automatic personal identification using biometric data. Accurate automatic identification is needed in applications involving the use of passports, cellular telephones, automatic teller machines and driver licenses.
The existing search tools for searching and processing sequential data, including in the fields of bioinformatics and biometrics, are, however, limited in application due to lengthy processing times and/or lack of sensitivity. Therefore, there is a need for improved systems and methods for efficiently and accurately storing, processing, and searching large amounts of sequential data.