Traditionally, protein sequences were determined by stepwise, chemical degradation of purified proteins or fragments thereof. With the advent of sequence databases which contain complete genomic sequences or large numbers of complete or partial expressed gene sequences (expressed sequence tags, EST""s) (Goffeau et al. (1996), Science 274:546-549; Fraser et al. (1977) Nature 390:580-586; Neubauer et al. (1998) Nature Genetics 20:46-50), the sequences of most proteins can be determined by correlating experimental data extracted from the protein with sequence databases (Henzel et al. (1993) Proc. Natl. Acad. Sci. USA 90:5011-5015; Eng et al. (1994) J. Am. Soc. Mass. Spectrom. 5:976-989). The many implemented sequence database searching strategies have in common the use of a combination of specific constraints to narrow down a candidate list of matching proteins in a database to a single protein (Patterson et al. (1995) Electrophoresis 16:1791-1814). Currently, the most restrictive constraints are generated by mass spectrometric (MS) or tandem mass spectrometric (MS/MS) analysis of peptide mixtures after proteolysis of a purified protein or protein mixture with a specific protease.
The constraints provided by collision-induced dissociation (CID) of selected peptides are highly discriminating because CID spectra reflect the amino acid sequence of the peptide analyzed. MS/MS is generally practiced with peptides separated by capillary HPLC or capillary electrophoresis (CE) connected on-line to an electrospray ionization (ESI) MS/MS instrument. Peptides eluting from the separation system are detected by the first stage mass analyzer that also selects peptide ions automatically for CID followed by fragment analysis in a second mass analyzer. Observed spectra are used to identify the protein from which the peptide originated, either by automated correlation of uninterpreted CID spectra with a sequence database or by searching sequence databases with complete or partial peptide sequences obtained by manual or computer-assisted interpretation of CID spectra (Eng et al. (1994) J. Am. Soc. Mass. Spectrom. 5:976-989; Mann et al. (1994) Anal. Chem. 66:4390-4399, each incorporated herein by reference). The method has the significant advantage that a CID spectrum from a single peptide is sufficient to conclusively identify a protein (Susin et al. (1999) Nature 397:441-446, incorporated herein by reference in its entirety). Consequently, proteins can be identified by correlating CID spectra with databases containing incomplete gene sequences as found in EST databases. Components of protein mixtures can be identified without the need for purification and proteins can be identified across species, provided that the peptide segment analyzed is conserved between species. The method has the disadvantage that peptide ions need to be sequentially selected for CID out of a mixture of analytes (Ducret et al. (1998) Protein Science 7:706-719). The number of peptides present in a mixture may significantly exceed the number of CID spectra generated in the time available for analysis. For automated MS/MS operation the mass spectrometer is generally programmed to give highest priority for CID selection to ions with the highest ion current (Ducret et al. supra). Therefore, if complex peptide mixtures are analyzed, lower intensity peptide ions will not be selected for CID. This results in an apparent compression of the dynamic range that can be somewhat alleviated, but not eliminated, by extending the peptide analysis time (Goodlett et al. (1993) J. Microcolumn Separations 5:57-62; Davis et al. (1996) J. Am. Soc. Mass. Spectrom. 9:194-201, each incorporated herein by reference in their entirety).
The accurately measured masses of peptides in a protein digest represent a different type of constraint for database searching. Such peptide mass profiles or fingerprints are determined in a single stage of mass spectrometry without CID. The list of observed peptide masses, together with auxiliary constraints including the estimated molecular weight of the unfragmented parent protein and the cleavage specificity of the protease used are then searched against sequence databases using any one of a number of available algorithms (Henzel et al. (1993) Proc. Natl. Acad. Sic. USA 90:5011-5015; Patterson et al. (1995) Electrophoresis 16:1791-1814, each incorporated herein). Peptide mass mapping identifies proteins without sequence specific information because the subset of peptide masses created by digestion of a protein with a specific protease defines the N- or C-terminal boundary of each fragment and thus provides a set of constraints unique to a given protein. The more accurately peptide masses are measured and the more peptide masses are detected from the same protein, the more conclusively the protein identity can be determined (Fenyxc3x6 et al. (1998) Electrophoresis 19:998-1005, incorporated herein by reference). The peptide mass mapping approach has the advantage over the MS/MS strategy that the mass spectrometer operates in full scan mode (i.e., in a single stage) for the duration of the experiment, and should generally provide greater sensitivity. However, the method generally fails to identify the components of protein mixtures because it cannot be determined from which parent protein a specific peptide or set of peptides originated. Peptide mass fingerprinting is also incompatible with searching EST databases because it is unlikely that a sufficient number of peptide masses will match a single EST to provide an unambiguous correlation.
The present invention describes a class of reagents designated Isotope Distribution Encoded Tags (IDEnTs) and a method using the IDEnT concept for protein identification by accurate mass measurement of a single peptide, combining the strengths of the CID and peptide mass mapping approaches. Recent calculations for proteins expressed by the genomes of E. coli and S. cerevisiae, indicate that at 0.1 ppm mass accuracy 96% of the proteins will generate tryptic peptides with a unique mass, suggesting the feasibility of protein identification based on the mass of a single peptide. Inclusion of additional constraints such as the estimated molecular weight of the parent protein, the cleavage specificity of the protease used to digest and parent protein and the presence of an uncommon amino acid such as cysteine, methionine or tryptophane in the peptide sequence further enhances the stringency of the database search. Among these constraints the presence of cysteine in a peptide sequence is particularly attractive because the sulfhydryl side chain of cysteine residues is chemically distinct among amino acid residues and its presence significantly constrains the database search while still covering 92% of the open reading frames in yeast (Sechi and Chait (1998) Anal. Chem. 70:5150-5158). To employ this cysteine constraint for protein identification, it is essential that the cysteine-containing peptides be recognized in a peptide mixture. To this end a cysteine-specific alkylating reagent was synthesized which allows mass spectrometric identification of cysteine-containing peptides by the covalent addition of an isotope-distribution encoded tag or IDEnT (Lundell and Schreitmuller (1999) Anal. Biochem. 266:31-47).
The present invention describes an analytical strategy and the basic chemical concepts necessary to identify proteins in a sequence database from the accurately measured mass/charge of a single peptide using high-resolution mass spectrometry and a sequence constraint. This was achieved by covalently modifying peptides with a reagent specific for cysteine-containing peptides and that incorporates a non-native chemical element into the peptide such that the normal or expected isotope pattern for the peptide was changed. The process encodes the peptide with an isotope-distribution encoded tag (IDEnT) that can be decoded by high-resolution mass spectrometry. Once the IDEnT labeled peptide is decoded by visual inspection or computer algorithm analysis, then the parent protein identity can be determined by searching sequence databases with the accurately measured peptide mass and the cysteine constraint.
The IDEnT concept can be used as described above in the field of proteomics for the rapid identification of proteins. However, in another embodiment of the present invention, IDEnT labels can be incorporated into chemical reagents with target specificities for functional groups present in any class of chemical compound including, but not limited to, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), carbohydrates, lipids, proteins, surfactants, detergents and common polymers such as high density polyethylene. The non-native isotopic tag allows the IDEnT labeled analyte to be selectively detected in a mixture without prior knowledge of the m/z ratio or molecular weight. It uses known chemical reactivities or enzymatic activities to selectively direct the IDEnT label with high specificity to functional groups of interest known to be present in a given class of analyte. In addition, chemical compounds can be designed so that the IDEnT is incorporated during synthesis of a compound rather than after synthesis through selective chemical reaction with an IDEnT reagent. The selective incorporation of IDEnTs into biomolecules is expected to find wide application in the analysis of mixtures where detection and isolation of analytes with specific structural features can be accomplished using high-resolution mass spectrometry and/or tandem mass spectrometry.