Protein identification is a necessary step in many aspects of biological and medical research. The development of large protein databases has made it possible to identify many otherwise unidentified proteins by comparing information from their analysis, such as their sequences or mass spectra, with information in or from the database. Developments in high-throughput peptide analysis techniques, such as robotic gel band excision and digestion, and matrix-assisted laser desorption/ionization (MALDI) mass spectrometry, have made it possible to collect large volumes of data that characterize large numbers of experimental proteins. Such information can be compared with information in databases of known proteins in order to identify such experimental proteins.
A particularly powerful tool for characterizing and identifying proteins is mass spectrometry (MS), especially when used in conjunction with liquid chromatography (LC). With the use of LC/MS, the peptides of proteins that have been proteolytically digested are separated using methods of LC. A mass spectrometer then sorts the peptides according to their relative mass-to-charge ratio (m/z), producing a characteristic spectrum of peaks for the protein. With the use of tandem mass spectrometry (MS/MS), a single peptide of a protein can be selected and subjected to collision-induced dissociation (CID). CID produces fragment ions that are sorted according to their mass-to-charge ratios, producing a characteristic spectrum for the selected peptide. The repeated application of liquid chromatography tandem mass spectrometry (LC-MS/MS) can produce a number of spectra, each characterizing a different peptide.
A protein that has been characterized by methods such as LC-MS/MS can be identified by comparing its experimental data such as the mass spectra of its peptides with characteristic data such as theoretical mass spectra for peptides of previously identified (“known”) proteins. By comparing the experimental data of an unknown peptide to theoretically derived properties of known peptide sequences, the unknown peptide as well as the unknown protein to which the unknown peptide belongs can be identified. Searchable protein databases are available, e.g., at the National Center for Biotechnology Information (NCBI) website ncbi.nlm.nih.gov. They include databases of nucleotide sequence information and amino acid sequence information for proteins.
To evaluate MS/MS data for peptides using a nucleotide or protein sequence database, sequences in the database that represent proteins can be divided into sequences representing the peptides that would result from an actual proteolytic digestion of the proteins. A theoretical spectrum can then be generated for each peptide of a protein represented in the database, based on the sequence of the peptide. The theoretical spectrum includes mass-to-charge peaks that would be expected if the protein in the database were subjected to MS/MS and the peptide of interest was selected for characterization. Each theoretical peptide spectrum for proteins represented in the database can be compared to observed peptide spectra for an unknown protein. The similarity of the theoretical peptide spectra to the unknown peptide spectra can then be used to determine the identity of the unknown protein. The SEQUEST or MASCOT search engines implement such a routine for protein identification. For additional details on such approaches, see Eng J K, McCormack A L, and Yates J R 3rd, An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass. Spectrom. 1994, 5: 976-989, which is hereby incorporated by reference in its entirety.
The matching of proteins based on their MS/MS fragmentation spectra to data from peptides extracted from databases does not necessarily identify them unambiguously or with 100% confidence. Some spectra may match very closely while others match less closely. A close match may or may not indicate the identity of the unknown peptide. The likelihood of observing a close match by chance can be influenced by a variety of aspects of the comparison and search, including the amount of experimental data, size of the database, and redundancy in the database. Ideally, the effects of this variety of aspects are evaluated probabilistically and together, but finding the exact analytical expression can be very difficult.
Simple methods for identifying proteins using peptide match data do not account for most such aspects and so often are unreliable or require ad hoc interpretation. For example, a single peptide match could be used to identify the protein from which it was derived, but this approach may not be reliable. Ranking of matches can be used, but this approach may require ad hoc interpretation. For example, a second-best match in one analysis may be a true match indicating identity, whereas the best match in another analysis may be a false match obtained by chance. A multiplicity of peptide matches can be used to assess the identity of a protein, but this approach can share many of the same biases and shortcomings of other simple methods. Ideally, information indicative of matching is evaluated using methods that are objective, robust, and suitable for automation.