The present invention relates to identification of peptides based on their mass spectrometry (MS) characteristics.
High-throughput proteomic technologies seek to characterize the state of the proteome in a cell population in much the same manner that DNA microarrays seek to characterize the state of gene expression in a cell population. Characterization of the proteins can be done using several different methods, one of which is to digest the proteins first, typically using trypsin, into peptides which are then analyzed using tandem mass spectrometry (MS/MS). A typical procedure may involve extracting cellular proteins followed by tryptic digestion and then separating the peptides with liquid chromatography. The separated peptides are then identified by MS/MS. Ideally, peptides will subsequently be quantitated, post-translational modifications will be determined and the information regarding the peptides will be assembled into a picture of the proteomic state of a cell population in, into peptides which are then analyzed using tandem mass spectrometry (MS/MS). A typical procedure may involve extracting cellular proteins followed by tryptic digestion and then separating the peptides with liquid chromatography. The separated peptides are then identified by MS/MS. Ideally, peptides will subsequently be quantitated, post-translational modifications will be determined and the information regarding the peptides will be assembled into a picture of the proteomic state of a cell population.
Just as with DNA microarrays, quality assurance of the high-throughput process is of paramount importance in order for proteomics to be of value to biologists. If peptides are initially identified poorly, then this information and the information on post-translational state and quantitation of protein expression is not of much value. For this reason, there has been much work recently on developing peptide identification methods for MS/MS spectra. This area of research has proceeded on two fronts, the first of which seeks to take advantage of the wide availability of genome sequences. The database search methods try to identify the peptide that resulted in the observed MS/MS spectrum by picking the best candidate from a list of peptides generated from the genome sequence (e.g. Eng, K.; McCormack, A. L.; Yates, J. R. I. J Am Soc of Mass Spec 1994, 5, 976-989). De novo methods on the other hand, seek to sequence and hence identify a peptide simply from the observed MS/MS spectrum (e.g. Dan{hacek over (c)}ik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. J. Comput. Biol. 1999, 6, 327-342 (“Dan{hacek over (c)}ik et al.” herein). Regardless of which approach is used, it is essential to have a method for scoring each peptide so that accurate and reliable identifications can be made.
SEQUEST, for example, scores peptides by calculating the overlap integral between a model spectrum for a peptide and the experimental spectrum. Both the model spectrum and the experimental spectrum are transformed into continuous functions in order to calculate the overlap integral. This approach has been successful as measured by the number of labs that use it. However, interpretation of the scores is not straightforward, and statistical confidence in the identification of the highest-scoring peptide remains in question. Criteria based on experience and on a more rigorous statistical analysis have been proposed to construct scoring thresholds above which an identification should be accepted.
Dan{hacek over (c)}ik et al. developed a more rigorous scoring scheme for use with de novo sequencing of peptides. De novo sequencing methods have not been as widely used as methods that identify the best peptides from a candidate list for several reasons. First, MS/MS spectra often do not contain enough information to allow for unambiguous determination of the entire peptide sequence. It has been estimated that 50% of spectra are missing enough peaks to allow only partial interpretation. Second, de novo approaches can be computationally intensive, which is an important criterion for high-throughput proteomics. Still, there is a significant need for de novo sequencing methods because often the most biologically interesting peptides, such as those containing mutations and frame-shifts, may not be in the sequence database to begin with. This will be especially true in clinical or field settings where the genome of the organism being studied differs from the genome of the organism that was sequenced.
An ideal MS/MS spectral analysis would have several desirable features. The scoring method would ideally report, as the score, the probability of a spectrum being due to a particular peptide. Short of that, the scoring would contain a rigorous test of significance of the results. Also, the scoring method should be well characterized as far as its rate of producing both false positive and false negative identifications. In addition, a combined analysis in which partial peptide sequences determined de novo can be scored alongside peptides obtained from a sequence-specific peptide database in a statistically meaningful manner is desirable. Such an ideal computational analysis would have the speed seen with database peptide identification programs, the unbiased nature of a de novo method, and statistically rigorous scoring.