1. Field of the Invention
The present invention relates to pattern match and, in particular, to architecture that provides efficient pattern match operations.
2. Description of the Related Art
Pattern match applications are becoming increasingly important in life science, Internet commerce, and other fields. In general, a pattern match application includes four steps. In a first step, the database can be prepared. This step can further include creating secondary files (e.g. index files), where needed, and updating the database as needed. In a second step, a search can be performed that compares candidates within the database with an input query pattern. Typically, the pattern match application specifies one or more minimal requirements for matching. In a third step, each candidate is scored. Finally, in a fourth step, the most relevant matches can be output.
One familiar example of pattern match is an Internet search engine that finds web pages matching the input query words, wherein the resulting database consists of all stored web URLs. Key words may be used as indices within index tables to quickly find a number of web URLs that contain the word. In this case, the pattern to be matched is a one-dimensional search consisting of a series of letters forming words and phrases. The pattern match operation consists of a “search” part followed by a “score” part. The match candidates may consist of all URLs containing at least one word in the input query, for example. Note that the vast majority of URLs that don't contain any query words are not scored. Each candidate is then scored according to a formula that may involve the number of input query words within the URL, how many other URLs reference the given candidate, and so on. These scored web pages, stored in an output file, can be displayed to the user in decreasing final score.
Pattern match is also used extensively in the life sciences. For example, within the field of genomics, the popular program BLAST from the National Center for Biotechnology Institute (NCBI) accepts a DNA or protein sequence represented by a string of characters, wherein each string of characters represent a nucleotide or amino acid. For each query, the BLAST program will output a set of the most closely matched sequences from the database based on a user-specified scoring criterion. Drug discovery researchers use BLAST to derive a hypothesis of the function of an unknown protein or DNA sequence by comparing it to a database of known protein and genome entries.
Pattern match can also be used in the field of proteomics for protein identification using mass spectrometry. For example, programs such as SEQUEST from the University of Washington, X!TANDEM from the Manitoba Proteomics Centre, and OMSSA from NCBI identify an unknown protein by matching its measured mass spectrum against theoretical mass spectra from a protein database. Specifically, the proteins in the resulting database that provide the highest scoring matches with the query are the most likely candidates for the unidentified chemical.
Note that while one-dimensional pattern match is appropriate for some applications, multi-dimensional pattern match may be desirable for other applications. For example, protein structure prediction using Support Vector Machines and other machine learning methods involves matching the input protein sequence to sets of categorized protein classes in multi-dimensional vector space. Two-dimension image analysis is used to match and align 2-D GEL images involving thousands or more patterns. Phylogenic studies require matching the genomes from different species to determine their evolutionary chronology. These analyses involve significant computation that matches complex patterns. This type of pattern match can benefit from massively parallel integer (or simple floating point) operations of relatively large databases.
In contrast to calculation-based applications that emphasize high-precision floating-point performance, current pattern match applications require massively parallel variable-length integer operations on large databases. Unfortunately, standard computers are not well suited to pattern match because of their rigid (usually 32-bit or 64-bit) arithmetic logic units (ALU), the relatively small number of such ALUs per processor, limited memory address space (only 4 gigabytes in popular 32-bit processors like the Intel Pentium 4), and bus-limited memory bandwidth between the processor and the main memory.
Additionally, on-chip resources used in standard computers, such as floating point units and cache memory, occupy significant chip space and increase static power dissipation, but are used rarely during run-time of the pattern match application. For example, for the bulk of the pattern match application run-time, where more than 95% is spent in the “search and score” functions, these on-chip resources do not add value and, in fact, may adversely affect the running of a pattern match application. For example, the 32-bit address space of a Pentium 4 PC limits the main memory to 4 gigabytes, which requires larger databases (e.g. the databases typically found in pattern match applications) to be partitioned and processed in pieces, thereby increasing runtime as well as the complexity of the pattern match software.
Sequential Searching Using a Standard Computer
FIG. 1 illustrates a simplified pattern match method 100 including a sequential search. In step 101, a database can be prepared. Note that in typical embodiments, step 101 can be performed separately from the pattern match application, and is merely shown in method 100 for completeness. In a sequential search, database preparation may be as simple as assembling the correct file with the correct file name. In one embodiment, the database may also be reorganized to strip out any annotation and other header information from the database core and replace that information with a small numerical value. This reorganization can reduce the file size of the database during the search, thereby streamlining the sequential comparisons. Further variations can include allowing multiple databases to be accessed, or restricting the analysis to specific sections of the database, e.g. accessing only the human entries in a protein database rather than across all species.
When a pattern match application is run on a query, the query is first pre-processed in step 102 to check for errors/missing information, make format adjustments, or eliminate noise. In one embodiment, the query may also be modified to specific data representations that improve pattern match, e.g. building a table of all occurrences of 3-letter instances to search for and their location in the query. Note that in some embodiments, multiple queries can be simultaneously used. For example, the OMSSA protein identification program can store several hundred queries before the database is searched. In this manner, as each database entry is analyzed, there are more chances for this entry to be a candidate match for one of the stored queries.
After the query (or queries) is pre-processed, the database can be searched sequentially. In general, a sequential search means that all the entries in the database are analyzed one at a time against the query, e.g. from the first entry to the last. To begin this search, step 103 can determine whether another database entry is present. If so, then the database entry can be compared to the query in step 104.
In one embodiment of step 105, each entry in the database (also called a candidate herein) can be pattern-matched with the query on a “rough scale” to find a hit. For example in the case of BLAST protein searches, each protein database entry could be checked for the existence of a 3-letter combination determined from the query pre-processing (step 102).
If the candidate meets the predefined criterion, e.g. using the rough scale, then step 106 can calculate the score and the results. In one embodiment, only scores that surpass a threshold are stored for post-processing. Note that steps 104-106, which comprise the “search and score” operation of the pattern match application, take up the majority of the run-time.
If the candidate does not meet the criterion in step 105, then that database entry is discarded. After either or steps 105 or 106, process 100 returns to step 103 to determine whether another database entry is present. After all database entries have been evaluated, as determined by step 103, the set of stored matches and their scores can be post-processed. This post-processing can include picking out the highest scoring ones, determining overall statistics, incorporating annotations where appropriate, and/or reformatting the match list for user readability.
Sequential searching using a standard computer has the following benefits. First, sequential searching allows substantially any conceivable criterion and/or pattern scoring method to be implemented. Second, a standard computer is well suited to the pre-processing and post-processing of the pattern match application. For example, the query can be filtered in step 102 to eliminate input noise and extraneous information. These functions often require processing of user options and format changes that can be easily performed by a standard computer. Similarly, post-processing of the sorted pattern match outputs in step 107 may involve extensive statistical processing and output formatting that can be easily performed by a standard computer.
Unfortunately, sequential searching using a standard computer also has the following disadvantages. First, searching every database entry results in algorithmic inefficiency. Specifically, the search time grows proportionately with the database size. Because the size is growing exponentially for many important life science databases, the search time quickly becomes impractical.
Second, the run-time of the standard computer is limited by processor bus bandwidth, not memory bandwidth. As noted above, most of the system time is spent in the search-and-score loop involving simple integer operations, which should theoretically be limited only by the memory bandwidth. However, in most cases, the processor bus is the limiting factor because it is shared by many processor peripherals.
Third, a processor of a standard computer can handle only one comparison in a serial manner, which limits the overall throughput of the pattern match application.
Fourth, the memory space limitation (32-bits, or 4 gigabytes for modern PCs) is too small for many DNA and protein databases. For example, the human genome alone contains 3.2 billion base pairs, or about 800 megabytes using 2 bits per base pair. Therefore, databases of multiple species can exceed the 4 gigabytes limit of a standard computer. To provide additional storage, disk access can be used, thereby compensating for this inadequate memory space. However, accessing a disk is highly inefficient and therefore can considerably slow the throughput.
Index Searching Using a Standard Computer
FIG. 2 illustrates a simplified pattern match method 200 including an indexed search. Compared to serial searching, an indexed search allows a faster run-time by looking only at entries with potential to be a candidate, at the expense of additional complexity during the database preparation step. Specifically, in step 201, a database can be prepared and an index table from that database can be generated.
Generating the index table typically includes creating a list of indices, wherein each index has an associated list of all database instances associated with that index. For example, for an index search with BLAST, the protein sequence database may be used to generate an index table, indexed by 3-letter keys, whose list contains all the instances of that 3-letter key within the database (e.g. in the form of database entry and beginning position).
The query can be pre-processed in step 202 to check for errors/missing information, make format adjustments, or eliminate noise (see also, step 102 of FIG. 1). After the query is pre-processed, the database can be searched by index. Specifically, step 203 can determine whether an index corresponding to the query is present.
For example, if the first 3 letters of the query are “DEF”, instead of looking for the instance of “DEF” in the first database entry, then the second, and so on (as would be performed for a sequential search), the pattern match application will read in the index file that lists all instances of “DEF” throughout the database. If an index corresponding to the query is present, then step 204 can determine whether a database entry in that index is present. If a database entry is present, then that database entry can be compared to the query in step 205.
If the candidate meets the predefined criterion, as determined by step 206, then step 207 can calculate the score and the results. If the candidate does not meet the criterion in step 206, then that database entry is discarded. After either or steps 206 or 207, process 200 returns to step 204 to determine whether another database entry is present. After all database entries in an index have been evaluated, as determined by step 204, process 200 returns to step 203 to determine whether another index associated with the query is present. Thus, for example, if the 4th letter in the query is “G”, then the pattern match application can read in the list associated with “EFG” for all of its instances. This process is repeated for all the 3-letter combinations until the last one is encountered at the end of the query sequence. Finally, after all indexes corresponding to the query have been evaluated, as determined by step 203, the set of stored matches and their scores can be post-processed in step 208.
The “search and score” operation for index searching is typically faster than sequential searching because fewer database entries are compared to the query (note that the actual scoring time would be the same). However, in practice, index searching is typically not used within the life sciences because of the following disadvantages.
First, the search criteria change too frequently, thereby requiring frequent re-indexing operations. Specifically, every time the search criterion changes, it is necessary to re-run the indexing program. For example, if the search criterion for BLAST looks for 4-letter combinations rather than 3-letter ones, the entire index table must be re-generated. For large databases, the indexing time can take many hours, which results in reduced overall throughput.
Second, the index table generation takes too long for users. Standard computers are slow at generating files that exceed several gigabytes for at least two reasons: limited address space and inefficiency in sorting large files. Large index tables on large databases can easily exceed the main memory limit of PCs, thereby causing the computer to operate on small chunks memory, e.g. 256 megabytes of memory, at a time. As mentioned previously, this can result in excessive disk swapping. Additionally, the sorting operation required within index table generation is a known, time-consuming software problem on large databases, especially if the each sorting block must be within the limited main memory space.
Third, the index table is too large, thereby requiring frequent disk access that reduces run-time improvements. Specifically, the index table is typically an “inverted file”, wherein the entries are listed using the attribute as the key. In practical terms, this means that the index table can be from 3 to greater than 10 times larger than the database.
Thus, running pattern-matching applications on standard computers, whether using sequential searching or index searching, have significant disadvantages.
PC Clusters
To increase throughput using standard computer architecture, clusters of networked PCs can be used. A PC cluster includes a master “node” that controls the overall data flow with the other slave nodes doing the distributed computation. Typically, a large database is partitioned evenly across all the nodes. The same “search and score” application can be run on all the nodes, with the collective results reported back to the master node for consolidation.
While the PC cluster increases the throughput compared to single computers, a PC cluster also has disadvantages. For example, a PC cluster has the same silicon inefficiency as a single computer. Specifically, the overall memory architecture is also distributed locally among the nodes rather than shared among the nodes. This distribution makes for inefficiency where common data must be replicated in the distributed memory blocks.
Moreover, a PC cluster still has a limit as to how many processors may be used due to space and power dissipation. Specifically, a processor (e.g. a processor drawing on the order of 65 watts) and associated disk drives can generate considerable heat. Thus, the heat generated by a PC cluster, if not dissipated, can significantly increase internal operating temperatures to the point where system performance is compromised (i.e. a reduction of system reliability). Therefore, the use of PC clusters must include providing considerable power to sufficiently cool the system. Additionally, the use of PC cluster further includes increased maintenance time to service defective processors due to heat-induced failures. For example, the time involved in loading maintenance software on each of a 100-processor PC cluster may be close to 100 times that of a single PC.
Reconfigurable Computers
Reconfigurable computers are special purpose data processing systems for accelerating algorithms in compute-intensive problems using a massively parallel approach. Reconfigurable computers, typically built using large FPGAs, offer very high computing resources with a high-degree of interconnectivity and relatively small main memories.
For example, FIG. 3A illustrates a commercially available reconfigurable computing platform 300 (the “BenERA” platform with up to four DIME-II modules, all available from Nallatech in Glasgow, Scotland), with a maximum of 80 million system gates (using eight XC2V10000 Xilinx FPGAs, each with 10 million system gates) in the system core. Platform 300 provides high-speed 64-bit busses between each DIME-II module (e.g. a “BenBLUE-II” module 301, as illustrated in greater detail in FIG. 3B, each containing up to two XC2V10000 FPGAs). Notably, platform 300 provides only 32 megabytes of memory for system use.
Reconfigurable computers have been popular for specialized digital signal processing applications that require extensive computation on a relatively small memory space, such as performing 1024-point Fast Fourier Transform or image compression using a discrete cosine transform in target applications such as cellular base stations, radar, or medical imaging. Unfortunately, reconfigurable computers, particularly due to the high logic-to-memory resources and the scarcity of main memory, are not directly suited for pattern match applications. Specifically, as noted above, pattern match applications have small computation requirements (in the sense of small integer or fixed point arithmetic) and a limited need for interconnectivity, but are very memory-intensive.
Bioinformatics Accelerators
FIG. 4 illustrates a simplified bioinformatics accelerator architecture 400 including a host computer 401 and an accelerator 402. In a typical embodiment, accelerator 402 can include a logic block 403 (either fixed or reconfigurable) and a scratch memory. FIG. 5 illustrates the basic functions performed by host computer 401 and accelerator 402.
Bioinformatics accelerator 400 uses an approach similar to that discussed in reference to “Sequential Searching Using Standard Computer”, wherein each database entry is individually compared. However, to achieve run time improvement through parallelism, multiple (e.g. 100 or more) queries are batched together and downloaded from host computer 401 to scratch memory 404 (step 501). After a predetermined number of queries are stored in scratch memory 404, the database entries (stored on host computer 401) are sent (step 502) to logic block 403 one-by-one for comparison to all of the stored queries in parallel (e.g. comparing 1 entry versus N queries) (step 504).
When a matching entry is found by any of the parallel comparison circuits, logic block 403 can generate a score (step 504). That score can be reported back to host computer 401 (step 505) as well as stored in scratch memory 404 (step 506). After the entire database is streamed through accelerator 402, host computer 401 can provide another batch of queries. When all processing is complete, host computer 401 can provide any post-process function as well as output the results (step 503).
Using a batch approach to queries can be significantly faster than sequential searching. However, the bioinformatics accelerator has the following disadvantages. First, the response time for each query is slow. Specifically, architecture 400 cannot provide fast single-query response times, such as the case where real-time pattern match is needed. For example, it is desirable for a pattern match application to be run in conjunction with a protein or genomic sequencing system, so that each individual output can be analyzed as it becomes available. The batch analysis approach of the bioinformatics accelerator cannot provide an analysis of individual outputs.
Second, the bus bandwidth of architecture 400 limits overall performance. Specifically, the minimum time to complete a query search is determined by the time it takes host computer 401 to read out and send the entire database to accelerator 402, plus the time accelerator 402 requires to compare and score. Thus, this minimum time can be limited by the bus bandwidth of architecture 400.
Third, architecture 400 is incompatible with index search algorithms that can be employed in software-only solutions. As noted previously, the speed benefit from index searching comes from getting more information with each memory access, at the expense of accessing a much larger file (i.e. the index table, which is typically 3 to 10 or greater times the database for many common search conditions). For standard computers, the larger file sizes do not add memory access time significantly. However, for bioinformatics accelerators that stream the database through the accelerator, larger file sizes cause longer runtime. Therefore, bioinformatics accelerators are typically not used for index searching, thereby limiting their use to cases where required re-indexing is impractical or when specifically requested by a user.
Therefore, a need arises for a data processing architecture optimized for pattern match applications, especially those currently found in life science, Internet commerce, and other fields. This data processing architecture would have the following attributes: high-throughput, silicon efficiency, low power dissipation, smaller space, high reliability, optimized memory bandwidth, and a scalable architecture.