Tandem mass spectrometry (MS/MS) has become the leading high-throughput technology for protein identification. A tandem mass spectrometer is capable of ionizing a mixture of peptides with different sequences and measuring their respective parent mass/charge ratios, selectively fragmenting each peptide into pieces and measuring the mass/charge ratios of the fragment ions. Thus, a tandem mass spectrum can be viewed as a collection of fragment masses from a single peptide. This set of mass values is a “fingerprint” that identifies the peptide. The peptide sequencing problem is then to derive the sequence of the peptides given their MS/MS spectra. For an ideal fragmentation process and an ideal mass spectrometer, the sequence of a peptide could be easily determined by converting the mass differences of the consecutive ions in a spectrum to the corresponding amino acids. This ideal situation would occur if the fragmentation process could be controlled so that each peptide was cleaved between every two consecutive amino acids and a single charge was retained on only the N-terminal piece. In practice, however, the fragmentation processes in mass spectrometers are far from ideal.
The problem for tandem mass spectrometry peptide sequencing is, given a spectrum S, the ion types Δ, and the mass m, find a peptide of mass m with the maximal match to spectrum S. Peptide fragmentation in a tandem mass spectrometer can be characterized by a set of numbers Δ={δl, . . . , δk} representing ion types. A δ-ion of a partial peptide P′⊂P is a modification of P′ that has mass m(P′)-δ. For tandem mass spectrometry, the theoretical spectrum of peptide P can be calculated by subtracting all possible ion types {δl, . . . , δk from the masses of all partial peptides of P (i.e., every partial peptide generates k masses in the theoretical spectrum.) An (experimental) spectrum S={sl, . . . , sm} is a set of masses of fragment ions. A match between spectrum S and peptide P is the number of masses that experimental and theoretical spectra have in common.
Recent progress in mass spectrometry instrumentation has produced LTQ-FT mass spectrometers that can generate on the order of 100,000 spectra per day per machine. Software is a significant and limiting factor in mass spectrometry proteomics analysis—typical large datasets may require days or weeks of computational time on expensive computers or grids.
Most peptide identification algorithms use database search methods that match the spectra against a protein database. Existing database search methods in mass spectrometry, such as SEQUEST (U.S. Pat. No. 6,017,693, which is incorporated herein by reference) and MASCOT, match spectra against a sequence database to identify the peptides. FIG. 1 illustrates an exemplary process for a spectrum matching techniques for peptide identification. Specifically, a sample 12 is provided to a tandem mass spectrometer 14. A two-step process is illustrated, however, single step processes are also known. In the first mass spectrometer, a peptide ion is selected, so that a targeted component of a specific mass is separated from the rest of the sample 14a. The targeted component is then activated or decomposed. In the case of a peptide, the result will be a mixture of the ionized parent peptide (“precursor ion”) and component peptides of lower mass which are ionized to various states. A number of activation methods can be used including collisions with neutral gases (also referred to as collision induced dissolution). The parent peptide and its fragments are then provided to the second mass spectrometer 14c, which outputs an intensity and m/z for each of the plurality of fragments in the fragment mixture. This information can be output as a fragment mass spectrum 16. In the spectrum 16, each fragment ion is represented as a bar graph whose abscissa value indicates the mass-to-charge ratio (m/z) and whose ordinate value represents intensity. In the process, sub-sequences contained in the protein sequence library 20 are used as a basis for predicting a plurality of mass spectra 22. The predicted mass spectra 22 of the sub-sequences are compared 24 to the experimentally-derived fragment spectrum 16 to identify one or more of the predicted mass spectra which most closely match the experimentally-derived fragment spectrum 16. A report containing one or more of the matching “potential” sub-sequences are output to a monitor, printer, or other viewing means 28 and/or the data is stored in a storage medium 26 for subsequent retrieval for further processing or viewing.
While these spectrum matching tools are invaluable, they are too slow for matching large MS/MS datasets against large protein databases. Since SEQUEST compares every spectrum against every database peptide, it would take a cluster of about 60 processors to analyze in real time the spectra produced by one of the newer mass-spectrometers (if searching through the Swiss-Prot database). If one were to attempt to perform a time-consuming search for post-translational modifications, the running time may further increase by orders of magnitude. One of the major problems in tandem mass spectrometry is the lack of a concrete theoretical probability model, thus requiring searching of the spectrum against a random decoy (negative control) database to empirically estimate the error rates (often represented by Poisson, Gaussian, hypergeometric, or other approximations of tails of score distributions) as opposed to the analytically derived and database-independent error rates in genomics tools such as BLAST. In fact, the Proteomics Publication Guidelines recommend searching in decoy databases to determine the statistical significance of peptide identifications. The rationale behind using a decoy database is to estimate the number of spectra that match the database by chance. If a spectrum S has probability p(S) of matching a random database, then a decoy database is simply a time-consuming way to evaluate Σp(S) over all spectra in the dataset. This sum represents the expected number of hits in the decoy database but is not a good way to estimate individual probabilities p(S).
From one perspective, use of decoy databases can be seen as an acknowledgment of an inability to solve the following problem: Given a spectrum S and a score threshold T for a spectrum-peptide scoring function, find the probability that a random peptide matches the spectrum S with score equal to or larger than T.
One proposed solution to this Spectrum Matching Problem takes a heuristic approach based on approximating the tail of the score distribution. Solving the Spectrum Matching Problem is equivalent to computing the False Positive Rates (FPR) of spectral matches. FPR is a property of an individual spectrum as opposed to the False Discovery Rate (FDR), which is the property of multiple spectra (proportion of incorrect identifications among all identifications judged correct).
Search in a decoy database appears to be an attractive approach for approximating the solution of the Spectrum Matching Problem as m/n, where m is the number of matches between the spectrum and the decoy database of size n (with scores equal to or larger than the threshold T). However, for an individual spectrum, the number of matches for typical n is usually zero, thus making this approach problematic (decoy and target databases usually have the same size). To obtain reliable FPR for an individual spectrum, one can increase n (e.g., making giant decoy databases 1000 times larger than target databases). Since this is impractical, some existing approaches bundle all spectra with the same score to evaluate the FDR of all spectra in the bundle and to use FDR as a surrogate for FPR. However, this approach can be a dangerous oversimplification because spectra with the same score may have vastly different FPRs, thus suggesting that careful analysis of all peaks in the spectrum (rather than the scores alone) may be necessary to compute the database matching statistics for individual spectra.
Although the target-decoy search strategy is currently viewed as the best way to distinguish between the correct and false identifications, this approach has a number of shortcomings, not the least of which is the effective doubling of the search time.
In addition to the complications resulting from the use of decoy databases, another problem with current methods is that a protein database is not always available, for example, when the samples are derived from an organism with an unknown proteome. In these cases, de novo peptide sequencing algorithms are required.
De novo peptide sequencing represents a fast alternative to MS/MS database search. While the best de novo algorithms are orders of magnitude faster than the fastest database search tools (even on moderately sized databases), they are less accurate. However, the superior accuracy of the database search tools becomes less pronounced with the increase in the database size. Thus, searches in very large databases represent an important niche where de novo based approaches are more accurate and orders of magnitude faster than the traditional database search approaches.
A number of de novo methods have been developed, including Lutefisk, SHERENGA, PepNovo, PEAKS, EigenMS, NovoHMM and PILOT. A commonly used technique in de novo methods is the spectrum graph approach, where a spectrum is represented as a graph with peaks as vertices that are connected by edges if their mass difference corresponds to the mass of an amino acid. The vertices of the spectrum graph are further scored based on peak intensities and neutral losses, and a peptide sequence is obtained by finding a longest path in the graph. This has been achieved using diverse optimization methods including branch and bound search (Lutefisk), dynamic programming (SHERENGA, PEAKS, PepNovo, NovoHMM), spring models (EigenMS), and integer programming (PILOT).
De novo peptide sequencing can be viewed as a search in the database of all possible peptides. Even if this time consuming search were feasible, it would remain unclear which peptide in the database of all peptides represents the actual peptide that generated the spectrum. It is estimated that in about half of the cases, the existing database search tools will fail to identify the correct peptide since its score will be lower than the score of an incorrect peptide. For a typical spectrum identified in a database search, there may be hundreds, and even thousands, of very different peptides that “explain” the spectrum better. As a result, any de novo peptide sequencing algorithm should output multiple peptide reconstructions rather than a single reconstruction. Matching these peptides against a database results in a hybrid de novo based database search that bypasses the time-consuming matching of spectra against the database.
Similar to generating the covering set of tags (that in most applications limited to tags of length 3), one can attempt to generate the covering sets of full length peptide reconstructions that with high probability contain the correct peptide, i.e., a “spectral dictionary”. Spectral dictionaries take the peptide sequence tag approach one step further by generating peptide reconstructions and ensuring that one of them is correct. They also have the potential to improve the filtration efficiency of tag based tools, for example, the filtration efficiency of 1000 de novo reconstructions of length 10 is orders of magnitude higher than even a single tag of length 3. However, while spectral dictionaries have important advantages over spectral tags, generating them remains an open problem.
Spectral dictionaries may have an edge over the traditional MS/MS approaches in searching very large databases, e.g., six-frame translations of entire genomes. Various proteogenomic studies have demonstrated that MS/MS search against a six frame translation of the genome allows one to use MS/MS data for finding new genes, predicting programmed frame shifts, correcting DNA sequencing errors, etc. However, existing MS/MS database search tools are impractical for searches against the six-frame translation of large genomes such as the human genome (˜3 billion amino acids after removing repeats). Indeed, most early proteogenomic studies were limited to searches against the 6-frame translations of bacterial genomes. The largest proteogenomic analysis conducted so far was the search against the 6-frame translation of Arabidopsis thaliana, which resulted in the discovery of nearly 400 new genes using InsPecT. However, InsPecT cannot be scaled to search the 20-times larger 6-frame translation of the human genome.
Spectral dictionaries make the size of the database almost irrelevant since the spectral dictionary can be matched against the six-frame translation as efficiently as against a much smaller database of known proteins. Since many genes remain unidentified even in the well studied organisms, the searches in six-frame translation represent a valuable tool for proteogenomic annotations. Spectral dictionaries are also helpful in searches for fusion peptides that are common in tumor proteomes but not explicitly present in protein databases.
Spectral dictionaries allow every MS/MS database search tool to be turned into a de novo peptide sequencing software (by simply running this tool on all peptides from the spectral dictionary and selecting the top scoring peptide). After such “conversion”, it may be possible to estimate how well both database search tools and de novo tools would perform on very large databases. This experiment, however, yields a disappointing performance of both de novo and database search tools: only 35% to 42% of peptides of length 10 were correctly reconstructed in such experiments (35%, 38%, and 42% for X!Tandem, PepNovo, and InsPecT, respectively).
The key unsolved problem is how many reconstructions must be generated to avoid losing the correct peptide. Generating too few peptides will lead to false negative errors while generating too many peptides will lead to false positive errors. Some de novo algorithms output a single or a fixed number (decided before the search) of peptides. For some spectra, generating only one reconstruction may be enough to guarantee finding the correct peptide while in other cases (even with the same parent mass), a thousand reconstructions may be insufficient.
The problem of generating varying numbers of reconstructions for each spectrum becomes particularly important for long peptides with the increasing complexity of the search space. For example, the accuracy of PepNovo (i.e., the percentage of correctly reconstructed amino acids) falls sharply with increase in the peptide length, from 89% for length 7 to 50% for length 20 peptides. As a result, PepNovo correctly reconstructs 59% of peptides of length 7 and only 8% of peptides of length 20.
A recently disclosed method addressed de novo peptide sequencing for data acquired from FT-ICR instruments when both the parent mass and the peak positions are accurate. However, acquiring such spectra can be expensive and time-consuming. An intermediate approach is to acquire mass spectra with high precision at MS1 stage and lower precision at MS/MS stage, giving accurate parent mass but inaccurate peak positions. However, the existing de novo search methods are aimed toward low accuracy ion trap mass spectrometers that have parent mass errors on the order of 1 Dalton. Since vertices in the spectrum graph are constructed based on low accuracy peaks, it is not clear in these algorithms how to exploit the accurate parent mass information that is available from new high accuracy instruments. In other words, it is not clear how to incorporate “high accuracy parent mass/low accuracy MS/MS spectrum” data into the existing de novo approaches.
In view of the numerous shortcomings of existing methods, the need remains for faster and more accurate methods for interpreting tandem mass spectra for peptide identification. The present invention is directed to such methods.