In many fields, large amounts of pattern data have been accumulated and stored in innumerable databases. However, there is a lack of the capacity to utilize the enormous amounts of data collected and stored. There is mounting interest in compact and efficient database searching techniques to locate a variety of different patterns. Such patterns may include nucleotide sequences, amino acid (e.g. peptide) sequences, geological samples, binary data, textual data, etc. In the particular field of bioinformatics, attempts are made to understand the information stored in nucleotide sequences comprising DNA (and other nucleotide sequences) and their translation into molecules of life, as well as efforts to understand peptide sequences. In numerous applications in bioinformatics, it may be desirable to search for particular sequences of nucleotides and amino acids. Text pattern matching presents a major computational challenge because sequence databases are growing exponentially.
At times, genomes from different species are compared and analyzed by using techniques referred to as “comparative genomics”. Researchers examine different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans. Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them. Tools, such as BLAST (available through NCBI), are available to perform such similarity searches.
As sequence data is generated, public databases are routinely scanned for similar sequences. Thereafter, sequence fragments may be collected by performing a cluster search to build into a larger consensus. Building consensus sequences and whole genomes requires pattern searches to find and mask repeat regions, followed by clustering searches and layered meta-clustering searches. In addition, comparative genomics requires large numbers of searches of different genomes to find related molecules. Given the current volume of sequence data and the speed at which it is growing, sequence searching is often a rate limiting step for modern genomics.
Most current searching methods look up pattern position information in a single array data structure. The index of this single array is often calculated by a function that maps the search pattern into a numeric index. The array is then examined at the location represented by the index. The array usually contains a reference to the positions of the patterns that are being searched. For example, the SSAHA (Sequence Search and Alignment by Hashing Algorithm, available through The Sanger Institute, Cambridge, UK) method stores a single array for all possible sequence indexes. For large pattern lengths, the single array methods will generate a large and often extremely sparse array.
For large patterns the size or length of this single array data structure can become substantial. This single array will need to provide an entry or storage position for each possible unique pattern which may be searched for, but which may not necessarily be present within the database to be indexed.
This scheme allows a rapid search to be completed for any particular pattern but can be impractical for large pattern sizes. A large number of unique combinations of symbols are available to make up long length patterns which in turn place significant demands on the memory of a computer system used to facilitate such methods. Furthermore, the single large indexing array employed in prior art methods is comparatively sparsely populated with data, again resulting in a relatively inefficient use of resources. As can be appreciated by those skilled in the art the memory resources used to implement such systems will increase exponentially with a linear increase in the length of the pattern searched for.
There is a need for a process that finds patterns faster than existing processes and that places no limits on word sizes. The search capability should be efficient and compact to decrease memory usage compared to memory requirements by current search techniques.
All references, including any patents or patent applications cited in this specification are hereby incorporated by reference. No admission is made that any reference constitutes prior art. The discussion of the references states what their authors assert, and the applicants reserve the right to challenge the accuracy and pertinency of the cited documents. It will be clearly understood that, although a number of prior art publications are referred to herein, this reference does not constitute an admission that any of these documents form part of the common general knowledge in the art, in New Zealand or in any other country.
It is acknowledged that the term ‘comprise’ may, under varying jurisdictions, be attributed with either an exclusive or an inclusive meaning. For the purpose of this specification, and unless otherwise noted, the term ‘comprise’ shall have an inclusive meaning—i.e. that it will be taken to mean an inclusion of not only the listed components it directly references, but also other non-specified components or elements. This rationale will also be used when the term ‘comprised’ or ‘comprising’ is used in relation to one or more steps in a method or process.
It is an object of the present invention to address the foregoing problems or at least to provide the public with a useful choice.
Further aspects and advantages of the present invention will become apparent from the ensuing description which is given by way of example only.