1. Field of the Invention
The present invention relates generally to proteomics. More specifically, the present invention relates to using liquid chromatography in combination with mass spectrometry to identify and to quantify proteins, and peptides in a complex mixture, as well as to identify and quantify molecules in a complex mixture that produce precursor and fragment ions in a mass spectrometer. Further, the present invention also relates to using liquid chromatography in combination with mass spectrometry to retention-time track peptides in a complex mixture. More importantly the present invention provides a method of peptide identification without requiring the presence of a precursor ion mass thereby enabling the method to identify both chemically and post-translationally modified versions of peptides, allelic differences, peptides containing point mutations, as well as any other modifications of sequences deposited in the database being queried.
2. Background of the Invention
Proteomics generally refers to studies involving complex mixtures of proteins. The field of proteomics includes studying and cataloging proteins in biological systems. Proteomic studies typically focus on identification of proteins, determination of changes in relative abundance among different conditions, or both. Identification and quantification of proteins in complex biological samples is a fundamental problem in proteomics.
Liquid chromatography coupled with mass spectrometry (LC/MS) has become a fundamental tool in proteomic studies. Separation of intact proteins or of their proteolyzed peptide products by liquid chromatography (LC) and subsequent analysis by mass spectrometry (MS) forms the basis of many common proteomic methodologies. Methods that measure changes in the expression level of proteins are of great interest as they can form the basis of biomarker discovery and clinical diagnostics.
In conventional proteomic studies, proteins of interest typically are first digested to produce a specific set of proteolytic peptides rather than studying the intact proteins directly. The resulting peptides are then characterized during the proteomic analysis. A common enzyme used for such digestion is trypsin. In tryptic digestion, the proteins present in the complex mixture are cleaved to produce peptides as determined by the cleavage specificity of the proteolytic enzyme. From the identity and concentration of the observed peptides, algorithms known in the art can infer the identity and the concentration of the parent proteins.
In LC/MS analysis, the peptide digest is separated and analyzed by on-line, liquid chromatographic (LC) separation followed by on-line mass spectrometric (MS) analysis. Ideally, the mass of a single peptide, measured with sufficient accuracy, is sufficient to uniquely identity the peptide. In practice, however, achieved mass accuracies typically are on the order of 10 ppm or larger. In general, such mass accuracy is not sufficient to uniquely identify a peptide based upon mass measurement alone. For example, in the case of a mass accuracy of 10 ppm, on the order of 10 peptide sequences are identified in typical database searches. This number of sequences would increase significantly if search restraints on mass accuracy were lowered to allow for chemical or post-translational modifications, losses of H2O or NH3, point mutations, etc. Sequence repositories typically contain translated DNA sequences that have been annotated by homology to a known substrate. Thus, if a peptide's sequence is modified by either a deletions or substitutions, then tentative identification to that peptide by precursor mass only must be false.
Furthermore, two peptides can have the same amino acid composition but have different sequences. Mass accuracy alone is not sufficient to distinguish between peptides that differ in sequence but not in composition. Fragmentation techniques are known that cause peptides to break into fragments ions. These fragments can correspond to a subsequence of the original peptide, but other types of fragment ions may be observed. Fragment masses seen in the data can be used to confirm or deduce the precursor's sequence.
In the case of peptide precursors, subsequences can arise by the fragmentation at a single peptide bond in the precursor. Such fragmentation results in two sub-sequences. The fragment containing the peptide's C-terminal, if ionized, is termed a Y-ion, and the fragment containing the peptide's N-terminal, if ionized is termed a B-ion.
Known protein identification techniques search databases using accurate mass retention time (AMRT) data of precursors and fragments obtained from LC/MS experiments. For example, one way of obtaining such data is described in U.S. Pat. No. 6,717,130 to Bateman (“Bateman”), which is hereby incorporated by reference in its entirety. In Bateman, such data can be obtained using a high- and low-energy switching protocol applied as part of an LC/MS analysis of a single injection of a peptide mixture. In such data the low-energy spectra contains ions primarily from unfragmented precursors, while the high-energy spectra contain ions primarily from fragmented precursors.
To identify the presence of a protein in such data, an AMRT (empirically describing those ions from a peptide or from a fragment) is selected from the low-energy data. If trypsin is used in the digest, this AMRT is presumed to be a tryptic precursor. Using the AMRT data, known methods search a database of peptide masses for tryptic peptides whose masses lay within a mass search window or threshold.
If a theoretical peptide mass from a database lies within a mass search window of the mass of a precursor measured in the data, it is deemed a hit. That is, the precursor in the data is hit by the peptide in the database; or alternatively the peptide in the database is hit by the precursor in the data.
The search results in a hit-list of possible matching peptides from the database. These possible matching database peptides may or may not be weighted by statistical factors. The possible outcomes of such a search are that no possible matching database peptides are identified, one possible matching database peptide is identified, or more than one possible matching database peptide are identified. The higher the resolution of the MS, assuming proper instrument calibration, the smaller the ppm threshold, and consequently, the fewer the false identifications.
If there is one or more hit to the theoretical peptides in the database, conventional searches then use data from high-energy AMRTs to validate a possible matching database peptide. High-energy AMRTs are first searched to isolate those high-energy AMRTs that occur at the same retention time as the low-energy AMRT being validated. Typically, the high-energy AMRTs that are isolated are those whose retention times are substantially the same as the retention time of the low-energy AMRT being validated.
For each database peptide on the hit list, the algorithm determines the masses of all possible Y-ions and B-ions that can be obtained through collisionally induced disassociation of the precursor. The isolated high-energy AMRT data is then searched for each of these Y- and B-ions. The peptide sequence having the greatest number of hits, or satisfying other criteria, is returned as the correct hit, i.e., the identity of the target precursor. This result can be stored and displayed.
This process can be repeated for each low-energy AMRT in the digestion mixture. Further analysis can be performed on the results including storing the results, displaying the results, quantitation and combining the results with those of other injections.
During the search, multiple charge states and multiple isotopes can be searched. In addition, the ions, or the charged reduced AMRTs could be searched. Further, empirically produced confidence rules can be applied to help identify valid hits, and better confidence is obtained with a higher number of high-energy hits.
In summary, given a set of data acquired by an LC/MS system, known protein identification techniques search a database of theoretical protein sequences to identify the proteins in the data. That is, known protein identification techniques start with the data and search a database. The invention described below, in contrast, starts with the database and searches the data.