A growing numbers of essentially complete genome sequences now available allows global identification of proteins responding to specific physiological conditions to enable understanding of cellular pathways and networks. A review of this research is published in the publications which follow (these and all other papers, references, patents, or other published materials cited or referenced herein are hereby incorporated herein in their entirety by this reference):
Wilkins, M. R., Williams, K. L., Appel, R. D., Hochstrasser, D. F. Eds., “Proteome Research: New Frontiers in Functional Genomics,” Springer, Berlin, Germany, 1997.
Devine, K. M., Wolfe, K. Trends Genet 1995, 11, 429-431.
Uddhav, K., Ketan, S., Mol. Bio. Rep. 1998, 25, 27-43.
Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology, Science, 1998, 282, 2012-2018. Adams, M. D.,
Bioassays 1996, 18, 261-262. Anderson, L., Seilhammer, J., Electrophoresis 1997, 18, 533-537.
Resources on the internet which provide access to these sequences include http://www.ebi.ac.uk/research/cgg/genomes.html and http://www.ncbi.nlm.nih.gov/Entrez/Genome/main_genomes.html
Proteome analyses using either two dimensions (as shown in Washburn, M. P.; Wolters, D.; Yates, J. R. Nat. Biotechnol. 2001, 19, 242-247) or one dimension (as shown in Shen, Y.; Zhang, R.; Moore, R. J.; Kim, J. K.; Metz, T. O.; Hixson, K. K.; Zhao, R.; Livesay, E. A.; Udseth, H. R.; Smith, R. D. Anal. Chem. 2005, 77, 3090-3100) of liquid chromatography (LC) with tandem mass spectrometry (MS/MS) has become an important tool for protein identification due to its ability to rapidly identify complex mixtures of proteins with high sensitivity and limited bias. The dominant “bottom-up” approach described in these references allows for the identification of enzymatically produced (e.g., tryptic) peptides, and inference of the parent protein, due to the sequence related nature of peptide ion fragmentation, and the use of automated database searches (i.e., comparison of MS/MS spectra with theoretical spectra predicted from peptide sequence information as described in Eng, J. K.; McCormack, A. L.; Yates, J. R. Am. Soc. Mass Spectrm, 1994, 5, 976-989.)
The challenges associated with accurate identification in “bottom-up” proteomics are significant due to the complexity of the peptide mixtures. For example, a genome coding for ˜5,000 proteins can potentially produce >250,000 tryptic peptides.
At the same time, advances in instrumentation allow very large numbers of MS/MS spectra to be generated rapidly. While large amounts of data can easily be produced, the data suffers from variations in spectrum quality, protein abundances, sequence specific differences in dissociation pathways, the contributions of modified peptides, contaminates, and limitations on mass measurement accuracy. The reasons for these variations are described in the following publications:
Huang, Y.; Triscari, J. M.; Tseng, G. C.; Pasa-Tolic, L.; Lipton, M. S.; Smith, R. D.; Wysocki, V. H. Anal. Chem. 2005, 77, 5800-5813.
Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50.
Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392.
Strittmatter, E. F.; Kangas, L. J.; Petritis, K.; Mottaz, H. M.; Anderson, G. A.; Shen, Y.; Jacobs, J. M.; Camp, D. G.; Smith, R. D. J. Proteome Res. 2004, 3, 760-769.
The forgoing problems can result in significant levels of false positive identifications. In Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50 and Shen, Y., Kim, J.; Strittmatter, E. F.; Jacobs, J. M.; Capm, D. G.; Fang, R.; Tolic, N.; Moore, R. J.; Smith, R. D. PROTEOMICS, 2005, 5, 4034-404, reverse-database and cross-database methods are purposed for evaluation of peptide false identification rates. However, these methods fail to generally define clear boundaries between true/false positive/negative identifications, and the rates of false positive identifications increase approximately linearly with the number of possible peptides in the theoretical database for the system being studied (i.e. approximately with the size of the proteome). Furthermore, these methods deal with the peptides as population and do not provide any information for the individual peptides. As a result, increasing the database matching score criteria decreases the rate of identifying false positives, but simultaneously increases the probability of missed peptide identifications (i.e., a higher false negative rate).
Recently, in Craig, R.; Cortens, J. P.; Beavis, R. C., “The use of proteotypic peptide libraries for protein identification” Rapid Communications in Mass Spectrometry 2005, 19, (13), 1844-1850 and Kuster, B.; Schirle, M.; Mallick, P.; Aebersold, R., “Scoring proteomes with proteotypic peptide probes” Nature Reviews Molecular Cell Biology 2005, 6, (7), 577-583 the term proteotypic was coined for peptides in a protein sequence that is more likely to be confidently observed by a specific MS based proteomic method. Knowledge of the proteotypic peptides is very important in proteomics as it can increase the through-put and allow quantitative results to be obtained. It has even been suggested that accurate knowledge of the proteotypic peptides could lead on a paradigm shift of how proteomics is performed. Indeed, Beavis and co-workers showed that the knowledge of the proteotypic peptides can decrease the identification calculations by as much as 20-fold. Furthermore, Aebersold and co-workers indicated that once the proteotypic peptides of an organism are known, it is possible to synthesize heavy isotopes of these peptides, spike them in the sample of interest and achieve absolute quantitative data. In Le Bihan, T.; Robinson, M. D.; Stewart, I. I.; Figeys, D., “Definition and characterization of a “trypsinosome” from specific peptide characteristics by Nano-HPLC-MS/MS and in silico analysis of complex protein mixtures” J. Proteome Res. 2004, 3, 1138-1148, Le Bihan et al. showed that knowledge of the proteotypic peptides can help the identification of the low abundant proteins through inclusion lists that target the parent ions of these peptides.
The assumption that every peptide has an equal likelihood of detection and identification by LC-MS is not supported by the experience of scientists engaged in proteomic research. Indeed, even in cases where only one pure standard protein is digested by trypsin and analyzed by LC-MS, protein coverage of 100% is rarely observed. Differences in which peptides are observed, and the failure to observe certain peptides in any particular experiment, are generally the result of the specifics of the proteomic platform used for a particular experiment.
As used herein, a “proteomic platform” refers to the combination of the steps commonly performed to identify peptides in proteomic research; sample preparation, sample simplification, mass spectrometry, and the application of bioinformatic tools to the resulting data. As will be recognized by those having skill in the art, significant variations in the specifics of each of these steps exist, and these variations will have a significant impact on which peptides are observed.
Differences in sample preparation, such as the protein extraction methodology and/or the denaturizing agents used for digestion, will lead to different peptides being observed. For example, while trypsin is perhaps the most commonly used enzyme in sample preparation, it is just one of the enzymes that may be used for the digestion of the proteins. Different chemical or enzymatic denaturizing agents will effect which peptides present in the sample are ultimately observed. Also, the nature of solid phase extraction used for cleanup purposes will also affect which peptides will be observed.
Sample simplification refers to the very commonly used pre-fractionation and/or separation techniques often used for protein/peptide simplification before analysis by MS. For example, it is common to separate peptides by reversed phase liquid chromatography (RPLC) before analysis by mass spectrometry. During this step, very hydrophilic peptides might not be retained on the column and will elute in the void volume while highly hydrophobic peptides can be bound irreversibly in the stationary phase. In both cases, these peptides will generally not be detected. Furthermore, it is also very common, especially in cases of very complex proteomes, to perform a peptide pre-fractionation by using strong cation exchange (SCX) before RPLC-MS. While this approach generally reveals more peptides overall, there are several classes of peptides that might bind irreversibly to the SCX and never make it to the RPLC-MS. These peptides could be otherwise detected by a simple RPLC-MS analysis if the pre-fractionation had not been used.
With respect to mass spectrometry, the reasons peptides are not identified include the inability of certain peptides to ionize into the gas phase in sufficient quantities to give a detectable signal and/or to give interpretable MS/MS spectra (in the case of MS/MS experiments). For example, it has been widely observed in proteomics that peptides from the same protein produce a range of detected intensities, with a fraction of the peptides from each protein falling below the detection limit. It should be understood that as the term is used herein, “mass spectrometry” includes all of the different ionization techniques used, included but not limited, ESI, MALDI, APCI, either alone or in combination, and the different fragmentation techniques used, including but not limited to CID, ETD, ECD either alone or in combination. Differences in both the ionization techniques and the fragmentation techniques will lead to differences in the observed peptides.
Finally, the same LC-MS/MS data analyzed by different parameters and by different informatics tools results in somewhat different peptide identifications even when normalized to the same rate of false positives. This is because of the difference in scoring schemes used by different tools to interpret mass spectra. For example, MS/MS peptide identification software such as SEQUEST, Spectrum Mill, Mascot, and X-Tandem give peptide identification overlap of only about 70% for the same LC-MS/MS analyses and same false positive rate. This means that for the same proteomic platforms (i.e. same biological sample, same sample preparation, same sample simplification technique, same ionization and same mass spectrometer analyzer) and only different data analysis software (i.e. as an example between SEQUEST and Mascot), there are going to be only about 700 identical peptides identified between the two software tools for every 1000 total peptides identified.
These and other difficulties associated with identifying peptides in proteomic platforms has resulted in the development of various techniques which attempt to predict if a peptide will be accurately identified. For example, in Le Bihan, T.; Robinson, M. D.; Stewart, I. I.; Figeys, D., “Definition and characterization of a “trypsinosome” from specific peptide characteristics by Nano-HPLC-MS/MS and in silico analysis of complex protein mixtures” J. Proteome Res. 2004, 3, 1138-1148 the authors describe a model that can predict if a peptide will be identified in a proteomic platform. This study used three peptide physiochemical properties: hydrophobicity, isoelectric point (pI) and the length of a peptide to generate the prediction. In Ethier, M.; Figeys, D., “Strategy to design improved proteomic experiments based on statistical analyses of the chemical properties of identified peptides” Journal of Proteome Research 2005, 4, (6), 2201-2206) the authors extended Le Bihan's algorithm by using different weights for the hydrophobicity, pI and length of the peptides. They further trained the weights with data from 13 different proteomic platforms and used clustering analysis to group the platforms into different groups. Finally, Kuster et al. (Kuster, B.; Schirle, M.; Mallick, P.; Aebersold, R., “Scoring proteomes with proteotypic peptide probes” Nature Reviews Molecular Cell Biology 2005, 6, (7), 577-583) mentioned a computational approach for the identification of proteotypic peptides based on 500 different peptide physicochemical properties.
Drawbacks and limitations associated with these and other methods create a need for improved methods and techniques for predicting whether peptides will be detected in mass spectrometric analysis that simultaneously decrease the rate of false positive identifications and decrease the rate of false negative identifications.