Protein identification technology has applications in many fields. In the field of proteomics, for example, the ability to identify proteins in a cell or tissue sample is essential to the characterization of the expression and post-translational modification of various proteins and the presence and changes of various protein-protein complexes under different physiological conditions. Proteins do most of the work in cells as pumps, motors, enzymes, channels, signal receptors, amplifiers, and gene regulators. One gene from eukaryotic organisms may give rise to several different proteins, due to alternative splicing of components of the gene, and each protein may be subject to a myriad of post-translational modifications that control the activity, cellular localization, and protein-protein interactions of the protein.
Early in the development of proteomics technology, scientists made use of 1-dimensional (1-D) gel electrophoresis to study the components of protein complexes or 2-dimensional (2-D) gel electrophoresis to separate proteins, combined with subsequent mass spectrometry (MS) to identify proteins via the peptides released from the proteins using specific digestion methods. More recently, experimental approaches have also utilized a combination of liquid chromatography paired with mass spectroscopy (LC-MS). Both the gel approach and LC-MS have allowed the generation of large volumes of MS data that contain information to identify proteins, post-translational modifications thereof, and the members of protein-protein complexes.
Currently, there are several ways to use MS data to identify peptides. One widely used approach is to match the experimental peptide spectra produced by collision induced disassociation (CID) with calculated theoretical spectra of every peptide in a database, such as done by Sequest (Eng, J. K. et al., J. Am. Soc. Mass Spectrom. 5:976-989, 1994). Other methods such as the Mascot program (Perkins, D. N. et al., Electrophoresis 20:3551-3567, 1999), the Profound program (Zhang, W. and B. T. Chait, Anal Chem. 72:2482-2489, 2000), or ProteinProspector (Clauser, K. R. et al., Analytical Chemistry 71:2871-2882, 1999) take somewhat different approaches. The peptide mass fingerprinting mode attempts to match the peptide masses measured from a query protein to those deduced from each protein in an amino acid sequence database. A second mode uses the same approach, but adds additional information such as partial sequence, composition, or observed ions as well as the masses of the peptides generated from the protein. The third mode is a MS/MS spectral matching mode, similar to that used by Sequest. The major drawback of the approaches used by Sequest, Mascot, Profound, ProteinProspector, and similar programs is that they are geared to matching the data pair-wise from a single protein to every protein in a database. This is computationally time consuming and expensive. A modified approach, called Turbo Sequest (Thermo Electron, San Jose, Calif.) speeds up the process by creating a mass index to limit the range of peptides searched. However, this latter method has limitations in studies of proteins with post-translational modifications.
Mass spectral matching methods require that there be a match or a near match between an experimental spectrum and a wide range of theoretical spectra generated from a database. This spectral matching is a computationally demanding approach that requires a large number of pair-wise comparisons, each of which involves a large number of calculations. Many of these approaches also depend on the absolute masses of the ions in the MS/MS spectra. This means that the addition of any modification to a peptide causes the masses of many of the ion peaks to change and it is necessary to consider the effects of each possible modification on the theoretical spectra in order to match the shifted experimental peaks to the theoretical spectra of the modified peptide. Since the number of potential modification sites is large, there is a combinatorial explosion of possibilities and it is not practical to generate theoretical spectra from amino acid databases with all possible modifications. An approach to improve spectral matching of modified proteins has been taken by Pevzner and co-workers (Pevzner, P. A. et al., Genome Research 11:290-299, 2001). By measuring the deviations between the experimental and theoretical spectral peaks, it is possible to adjust the algorithm for singly modified peptides. However, if the peptide has more than one modification, this method also becomes impractical.
More effective methods for using mass spectral data to rapidly identify peptides and assign peptides to proteins are needed in the art.