1. Field of the Invention
The present invention relates generally to computer-based matching of patterns in quantitative and symbolic data, and in particular to searching large data sets for matching patterns using a dynamically constructed index organized into contiguous, non-overlapping ranges.
2. Description of the Related Art
A common problem in information retrieval is to find all documents in a large collection or entries in a large database that match a given pattern. Many of the specific pattern-matching problems and solutions thereto that are discussed in the computer science literature arise in text processing, because text data is voluminous, important in commerce and science, and straightforward to represent, if not to analyze. However, pattern matching in large data sets is also an important need in applications of biometric analysis, image analysis, data mining, Internet searching and bioinformatics.
For example, one large-scale analysis is to find all documents with certain keywords in a large corpus, say a database of all U.S. patent descriptions. One method of performing this analysis could include string searching, which is a well-characterized paradigm with several well-known algorithms addressing it. In a string-search, all documents are scanned to look for the matching terms. In some cases, the processing time for the scanning is directly correlated to the length of the database. However, with large datasets, this processing time is not fast enough to respond interactively for a new query.
Therefore, ways of pre-processing the database to support faster response times have been sought. An exemplary pre-processing tool is an “inverted file” index. This inverted file index, which can be maintained in some quickly accessible, ordered data structure, contains all the potential keywords and their locations in the original database. Using such an index, search terms can quickly be looked up and relevant documents located, albeit at the expense of a lengthy pre-processing step and of substantial extra storage (often exceeding the length of the original file).
Despite these disadvantages, most mainstream applications in text searching tend to use an inverted file index to deliver practical performance. Exemplary applications include commercial information systems (e.g. Dialog and Lexis-Nexis), Internet search engines, and toolkits for relational database management system platforms.
Unfortunately, because of the overhead associated with full indexing, an inverted file index may not be completely up to date when it is used. Moreover, the space requirements for the inverted file index leads to pressure to limit its contents. Consequently, frequently occurring words, e.g. articles and prepositions, are often left out. This exclusion may not significantly impact the specificity of a search term in itself, but can reduce the informativeness of a context containing those words.
The field of biomolecular sequence analysis adds new challenges to database pre-processing. Specifically, the growth of DNA and protein sequence databases has fuelled algorithmic developments for mining this data. For example, GenBank is a DNA database containing approximately 36 terabytes of sequence information, 90% of which has been listed in the last 5 years.
The superficial appearance of these sequence databases is as a collection of structured text documents. Although some early methods for sequence analysis were essentially recast string matching algorithms, a class of powerful new methods emerged that met the particular needs of sequence analysis. Particularly important among these were methods for similarity searching, i.e. finding sequences that had patterns in common with a test sequence, wherein the test sequence can be used for inferring homology. However, to model homology well and find evolutionarily divergent related sequences in the database, an algorithm must tolerate a degree of divergence significantly greater than any need that arises in text searching.
A good example of such an algorithm is the Smith-Waterman dynamic programming algorithm. Unfortunately, even with optimization, the speed of the Smith-Waterman dynamic programming algorithm is quadratic in the size of the database (as opposed to the near-linear performance of the string-matching algorithms discussed above). Therefore, the use of the Smith-Waterman dynamic programming algorithm for comprehensive sequence databases remains commercially impractical.
Consequently, there has been an increased interest in the inverted file index approach, even if it lacks the full sensitivity of, for example, the Smith-Waterman dynamic programming algorithm. For example, BLAST (Basic Local Alignment Search Tool), which is a common indexing tool, is based on using an index of all n-mer (wherein n-mer is any string of n characters) subsequences of the source database. The choice of n (for example, n is typically 3 for protein and 11 to 13 for DNA) is a balance between search speed and sensitivity. By reducing n, the algorithm approaches the sensitivity of Smith-Waterman dynamic programming algorithm, but its speed advantage is seriously eroded and there will be a lot more references to follow for each n-mer index entry. Another disadvantage of the BLAST indexed approach compared to Smith-Waterman is the time-consuming process involved in building the index in the first place. Specifically, because BLAST is in essentially a sequential scan, it could significantly accelerate performing the second and subsequent searches, but could take an unacceptable period of time to perform the first search.
In another type of search performed in biomolecular analysis, similar protein sequences are compared on the basis of observed and predicted mass spectra derived from those sequences. Mass spectrometry (MS) has recently become a common and powerful tool for protein analysis, particularly for characterizing novel proteins. In MS, the amino acid sequence of such proteins is obtained.
However, rather than analyzing proteins directly, MS is often applied to peptides derived from the protein(s) of interest. Peptides are generally easier to handle and more suited to the resolving power of mass spectrometers. Cleavage enzymes can cut the protein at specific amino acids and thereby can yield some information relating to the protein. Further, with shorter sequences, the more limited the number of amino acid combinations that can yield the observed mass/charge values, thereby facilitating sequence interpretation.
However, the MS of peptides obtained from proteins (e.g. using tryptic digestion), is not enough by itself to unambiguously determine the sequence of those peptides. Therefore, further information is required. In one approach, called Peptide Mass Fingerprinting (PMF), no further experimental analysis is required. Instead, PMF depends on combining data for all the peptides derived from a protein, and comparing that data to known proteins. Another approach, based on tandem MS, seeks to elicit more information about each individual peptide so that its sequence can be estimated. These approaches will now be further described.
In PMF, masses for the entire set of peptides derived from a protein are compared to a set of masses predicted from peptide sequences. These predicted peptide masses (called a spectrum) are generated computationally from known protein sequences. If most predicted peptide masses match the observed masses (within some tolerance for experimental error), then the parent proteins are putatively considered to match. Typically, in the absence of any further contextual knowledge, the new protein is compared in turn to each of a comprehensive collection of known (or predicted from DNA coding sequences) protein sequences such as SWISS-PROT, TREMBL or NR (NCBI's non-redundant database for protein sequence searching).
The main steps for PMF are as follows. Preprocess peak list (threshold, filter, normalize, etc.) and determine potential peptide ion mass/charge values. For every database protein sequence, (1) perform in-silico analysis of database sequence and calculate mass of peptide fragments, (2) compare observed masses of all peptides to all of the peptide masses calculated from database (within a certain tolerance), (3) score matches by certain algorithm-dependent criteria, and (4) retain candidate matches. Finally, a list of candidate matches can be sorted, ranked, and presented according to score.
In tandem MS, more sequence information is derived for each individual peptide, thereby allowing reconstruction of the protein sequence. Specifically, peptide ions separated by first-stage MS are individually further fragmented. The second-stage MS of those fragments is a convenient, automated method of analyzing each of those peptides.
The steps for tandem MS analysis of peptide fragment ions are as follows. Determine mass and charge of precursor ion from 1st stage MS. Preprocess tandem MS peak list (threshold, filter, normalize, etc.) and determine potential fragment ion mass/charge values. For every database protein sequence, (1) perform in-silico digestion of database sequence and calculate mass of peptide fragments, (2) compare observed mass of precursor peptide to each of the peptide masses calculated from database sequence for a match (within a certain tolerance), and (3) for every peptide in the database sequence that matches, (a) predict the fragmentation pattern and thereby the expected tandem mass spectrum for that peptide, (b) compare preprocessed experimental spectrum to computationally predicted spectrum for database peptide sequence, (c) score peptide matches by certain algorithm-dependent criteria, and (d) retain candidate peptide matches. Finally, a list of candidate matches can be sorted, ranked, and presented according to score.
Because fragmentation tends to occur in predictable sites between residues along the peptide backbone, and assuming there is good representation of pairwise fragmentation of the peptide between all adjacent pairs of amino acids, a “ladder” can theoretically be obtained from the mass spectrum. In this mass spectrum, each ion peak is separated from an adjacent peak by the mass corresponding to a single amino acid. Knowing the expected mass associated with each type of amino acid means that the sequence can be inferred.
However, due to noise, contaminants, imperfect representation, unpredictable cleavages, protein modifications, experimental error, multiple charges, and other variables, there are potential problems with reliably ascertaining the sequence of the peptides. Consequently, even with the extra information provided by tandem MS, researchers may still choose to compare the peptide spectra to protein sequence databases. In this case, each spectrum is derived from a single peptide sequence and its likely fragmentation patterns.
Note that in both PMF and tandem MS analysis, at least a comparison by mass of every measured peptide ion to every peptide predicted from every database sequence must be performed. This step presents an opportunity for speed-ups through efficient implementation, because it is in this step that the number of candidate peptides is greatly reduced. Moreover, although the subsequent detailed observed-to-predicted spectrum comparisons involve more computations individually than peptide mass comparisons, the peptide mass comparisons only need to be done for a small fraction of the database.
However, the comparisons of observed spectra to predicted spectra are necessarily approximate, because of biological variation and sample preparation artifacts, experimental error and variability in the MS measurements, database incompleteness and errors, and imperfect models for predicting the spectra from peptide sequences. Therefore, the algorithms that are used for making the comparisons must be sensitive, comprehensive, and error-tolerant-characteristics that, together with the very large number of peptide mass comparisons discussed above, make them computationally intensive and slow on large data sets.
Tandem MS analysis, which can generate thousands of spectra for one original sample, can be especially slow. For example, a typical laboratory-class PC computer available to researchers for MS data analysis may be equipped with a 2.5 GHz Pentium processor and 256 MByte of RAM. Using such a computer, a comparison of a single experimental spectrum may take several seconds against all sequences in SWISS-PROT, which is a highly curated and therefore condensed set of known protein sequences. More comprehensive databases such as the NCBI non-redundant protein sequence database would lengthen the search commensurately. These times compare with modern tandem MS instruments being capable of generating several spectra in every second of operation.
Unfortunately, this slow analysis time is compounded by the exponential growth rate of many sequence databases. Even SWISS-PROT, the rate of growth of which is constrained by its policy of rigorous curation, has nevertheless quadrupled since the first programs for Tandem MS-based protein database searching appeared (i.e. in about 10 years). Indeed, because of recent increases in genome sequencing and analysis projects, some of the more lightly or automatically annotated sequence databases have been growing much faster, and are now orders of magnitude larger than SWISS-PROT.
The combination of more rapid MS data generation and larger databases threatens the practicality of running protein identification programs routinely on laboratory-class PC-type computers. A partial solution has been to recognize that with multiple experiments that need to be analyzed in the same way, comparisons against the same database entries occur repeatedly, and so a speed-up can be obtained by preprocessing the sequence data. In particular, the peptide sequences that are derived from each protein sequence under a given proteolytic digestion scheme can be generated and the associated peptide mass can be pre-computed prior to any comparison with experimental spectra. Furthermore, those masses and references to the associated peptide sequences can be sorted and indexed by mass. In this manner, the comparison to any measured peptide involves comparing those index entries within a narrow mass range, i.e. the index entry centered on the observed peptide mass and extending to those index entries within an expected error. Therefore, the sorting and indexing by mass can significantly reduce the analysis time as now discussed.
In one exemplary operation, a peptide index is generated by modeling a tryptic digest of protein sequences from a database (e.g. SWISS-PROT) in which protein sequences are cleaved in silico (i.e. modeled computationally) after the amino acids lysine (K) and arginine (R). Because missed cleavages are experimentally common, those missed cleavages are modeled up to some limit. For example, peptides containing one or two internal Ks/Rs are also generated.
For a given parent protein sequence database, the size of the peptide index can be measured or estimated. For example, SWISS-PROT Release 43.5 currently has about 56 million amino acids of sequence, of an average length of 370 amino acids each, of which about 11% (every 9th amino acid on average) are K or R. In this case, ignoring end-effects due to the sequence actually being divided into 153 thousand discrete protein sequences, and assuming no missed cleavages, the number of generated peptides can be roughly estimated as the total number of amino acids divided by the interval between successive cleavage sites, i.e. 6.2 million in this example. Additionally if one or two missed cleavage cases are considered, then there will be nearly as many entries again for each case. Therefore, the total number of entries with up to 2 missed cleavages is about 18 million. If it is assumed that an index entry must consist minimally of a 4-byte fixed-point mass value together with a 3-byte pointer to parent protein sequence and a 1-byte peptide length field, then such a peptide index will need about 150 Mbytes of system RAM to represent it for rapid access in the database searching programs.
Several programs have used such indexes for MS analysis of peptides. These programs include MOWSE (Molecular Weight Search software developed by Darryl Pappin and Alan Bleasby that can be used for PMF analysis) and Turbo-SEQUEST (protein identification software from Thermo Electron Corp. that can be used for tandem MS analysis).
As described above, a tryptic digest of SWISS-PROT approaches the upper limit of the memory available (allowing for operating system and application software overhead) of a computer with 256 Mbytes of RAM. Doubling or quadrupling this amount of memory will allow for some growth in the databases and/or for more enzymes that cut more frequently.
However, order-of-magnitudes expansion of the storage requirements may still occur. First, storage requirements may significantly increase when source databases are much larger than the 153,000 entries in the SWISS-PROT database. Such source databases include, for example, the NCBI non-redundant database currently including 1.8 million protein entries or the human portion of the dbEST database currently including 5.6 million mRNA-derived entries. Second, storage requirements may also significantly increase based on decreased constraints. Specifically, because proteolytic digestion is in practice imperfect, it is desirable to generate predicted peptides with at least one end unconstrained to specific cleavage sites. This constraint reduction explodes the number of generated peptide sequences retained in an index.
These two cases are beyond the capacity of standard laboratory computers. Additionally, these cases significantly increase the time required to do the in-silico generation of peptides and subsequent indexing. Because of the expense, complexity, and maintenance requirements of supercomputing installations, it is preferable to avoid the route of scaling the hardware to match the larger database sizes.
Therefore, a need arises for a database comparison method suitable even for standard computers. Preferably, this database comparison method can work with smaller portions of the index depending on the experimental data, yet continue to be well matched to the spectrum acquisition rate of the MS instrumentation.
Similar considerations apply to the indexed searches for text-based information retrieval and to BLAST-style indexed sequence similarity searching. In both cases, there are constraints on the index that are due to the size of the source databases. Those constraints may lead to compromises on the contents of the index (e.g. references to frequently occurring words, or the n-mer size for BLAST indexes) and hence to reduced sensitivity or recall in searches using those indexes.
Therefore, a further need arises for a means of searching large datasets efficiently as in an indexed approach, but without the burdensome storage space, index-building run-time, and latency to first search requirements of a full index search.