Mass Spectrometry (MS) combined with database searching has become the preferred method for identifying proteins in the context of proteomics projects (See, e.g., Fenyo Beavis, Proteomics, A Trends Guide, July 2000, 22-26 Elsevier). In a typical proteome project, the proteins of interest are separated by one or two dimensional gel electrophoresis, or they can also be provided as mixtures of a small number of proteins fractionated by column chromatography. By using an enzyme, e.g. trypsin, the proteins are then digested into peptides. The measurement of the masses of the thus obtained peptides provides a peptide mass fingerprint (PMF). Such a PMF can be used to search a database or can be compared to another experimental PMF (See, e.g, Zhang, W. and Chait, B. T. 2000: ProFound: an expert system for protein identification using mass spectrometric peptide mapping information, Anal. Chem., 72:2482-2489, and James, P. ed. 2000: Proteome Research: Mass Spectrometry, Springer, Berlin). In certain circumstances, PFMs are not specific enough to the original protein to permit its non-ambiguous identification. In such cases, a second procedure may be applied, such as fragmentation (also referred to as dissociation) of the peptides (See, e.g., Papayannopoulos, I. A. 1995: The interpretation of collision-induced dissociation mass spectra of peptides, Mass Spectrometry Review, 14:49-73), which breaks the peptides into smaller molecules whose masses are measured. This procedure is called tandem mass spectrometry, tandem-MS, MS2 or MS/MS. The masses of the fragments constitute a very specific data set that is used to identify the original peptide. By extension, the MS/MS data for several peptides of a protein constitute a very specific data set that is used to identify the original protein (See, e.g., Henzel, W. J. et al. 1993: Identifying protein from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases, Proc. Natl. Acad. Sci. USA, 90:5011-5015, McCormack, A. L. et al. 1997: Direct analysis and identification of proteins in mixture by LC/MS/MS and database searching at the low-femtomole level, Anal. Chem., 69:767-776, James, P. ed. 2000: Proteome Research: Mass Spectrometry, Springer, Berlin).
Embodiments of the present invention improve the identification of the peptides based on MS/MS data, which comprise the measurement of the parent peptide mass and the measurement of the masses of its fragments.
A very common procedure when searching a database of biological sequences with mass spectrometry (See, e.g., Snyder, A. P. 2000: Interpreting Protein Mass Spectra, Oxford University Press, Washington D.C.) data is to compare the experimental spectra with theoretical spectra generated from the biological sequences stored in the database (See, e.g., James, P. ed. 2000: Proteome Research: Mass Spectrometry, Springer, Berlin). A scoring system is used to rate the matching between theoretical and experimental data. Typically, the database entry with the highest score is taken as the right representation of the experimental data. Ideally, the score is supplemented by a p-value estimating the probability to find a score equal or higher by random chance only. The p-value is used to give a measure of confidence to a match found in the database.
To date, the common practice for evaluating or scoring peptide matches has been manual analysis of spectra by trained technicians. While such methods are suitable for some mass spectrometry applications, manual analysis is a bottleneck in high throughput environments since data quality cannot be steadily maintained in high-throughput settings, causing automatic systems for scoring matches to suffer from low accuracy. High throughput systems for processing mass spectrometry data thus call for high quality scoring systems.
Scoring systems have several goals to meet. For example, one may be interested in searching large databases, such as an entire genome, as well as in detecting low-abundance proteins. Large databases require a very small rate of false positives since the erroneous peptide matches would be too numerous otherwise. This stresses the need for a very selective scoring system. In cases of low-abundance proteins, the MS data generally yielded is of lower quality compared to high abundance proteins. This in turn stresses the need for a very sensitive scoring system.
Currently available scoring systems lack selectivity because they can only take into consideration a small portion of the information available from mass spectra. For example, Bafna and Edwards, (See, e.g., Bafna, V. and Edwards, N. 2001: SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database, Bioinformatics, 17:S13-S21) consider only fragment masses, do not rely on parent peptide charge, and also do not calculate the likelihood ratio of observing a correct match versus observing a random match. Bafna and Edwards do not attempt to detect global patterns corresponding to structural constraints resulting from physical principles, like series of consecutive fragment matches. The same can also be said for the scoring system presented in Dancik et al. (See, e.g., Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E. and Pevzner, P. A. 1999: De novo peptide sequencing viatandem massspectrometry: a graph-theoretica approach, J. Comp. Biol., 6:327-342) and Havilio et al. (See, e.g., Havilio, M., Haddad, Y. and Smilansky, Z. 2003: Intensity-based statistical scorer for tandem mass spectrometry, Anal. Chem., 75:435-444), or other systems like that disclosed in European Patent Application No. EP 1 047 107 (assigned to Micromass Limited) and Zhang et al. (See, e.g., Zhang, N., Aebersold, R. and Schwikowski, B. 2002: ProbId: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data, Proteomics, 2:1406-1412). In addition, Bafna and Edwards do not use optimal statistics in their scoring process.
Other available scoring systems include Mascot (See, e.g., Pappin, D. J. C., Hojrup, P. and Bleasby, A. J. 1993: Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol., 3:327-332), Sequest (See, e.g., Eng, J. K., McCormack, A. L. and Yates, J. R. III 1994: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, J. Am. Soc. Mass Spectrom., 5:976-989, and U.S. Pat. No. 6,017,693), and SONAR MS/MS (available from ProteoMetrics Canada). The latter systems rely on ad hoc empirical definition of correlation between experimental spectra and theoretical peptide sequence.
Many authors, such as Anderson et al. (See, e.g, Anderson, D. C., Li, W., Payan, D. G. and Noble, W. S. 2003: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores, J. Proteome Res., 2:137-146), Keller et al. (See, e.g., Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R. 2002: Empirical statistical model to estimate the accuracy of peptide identification made by MSIMS and database search, Anal. Chem., 74:5385-5392), Moore et al. (See, e.g., Moore, R. E, Young, M. K. and Lee, T. D. 2002: Qscore: An algorithm for evaluating sequest database search results, J. Am. Soc. Mass Spectrom., 13:378-386), and Sadygov et al. (See, e.g., Sadygov, R. G., Eng, J., Durr, E., Saraf, A., McDonald, H., MacCoss, M. J. and Yates, J. 2002: Code development to improve the efficiency of automated MS/MS spectra interpretation, J. Proteome Res., 1:211-215), have recently developed systems to validate Sequest results automatically. Keller et al. (supra) also applies to Mascot. These systems constitute a hybrid category of model-based systems (mainly multivariate statistics) developed on top of heuristic systems. Their performance is generally superior to the original heuristic system but far from optimal. Compare Keller et al. (See, e.g., Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R. 2002: Empirical statistical model to estimate the accuracy of peptide identification made by MS/MS and database search, Anal. Chem., 74:5385-5392) and FIG. 10.