Proteins are the fundamental biological units of cell structure and are formed from linear sequences of amino acids linked together by peptide bonds. This primary amino acid sequence determines the three dimensional characteristics and the function of the protein. There are twenty common amino acids, each with an amino group, a carbon atom with a unique side chain, and a carboxyl group. During mRNA translation on ribosomes, the peptide bond backbone of a protein is sequentially formed by bonds linking the terminal carboxyl group of one amino acid to the N-terminal amino group of the subsequent amino acid. The resulting linear chain of various amino acids has a first amino acid, the N-terminal amino acid with an amino group, and a final amino acid, the C-terminal amino acid, with a carboxyl group. Although proteins vary in length from a few peptides for peptide hormones to over 1500 amino acids, most proteins are generally about 100 to 300 amino acids long.
Because the structure of proteins is directly related to ultimate physiological function, determining the amino acid sequence of proteins has long been a basic endeavor in biomedical research and medicine. Traditionally, amino acid analysis involved determining a relative percentage of each amino acid present in a digestion of a purified protein and determining of the identity of individual peptide residues using laboratory chemistry. Protein sequencing was a laborious effort involving enzymatic digestions of a large amount of a purified protein into peptide fragments, followed by Edman degradations and alignment of overlapping sequences. Currently, reflecting the growing need for more accurate methods of protein sequencing, tremendous advances have been made in protein sequencing using mass spectrometry (MS). DNA genome sequencing, computer informatics, and sensitive protein analysis methodologies using MS are interfacing with classical protein chemistry to greatly advance the emerging field of scientific research known as proteomics.
Proteomics is the field of protein research that studies the large scale or global analysis of the protein complement of an organism (Aebersold and Mann, 2003, Nature 422:198). Proteomics is uniquely important in research, diagnostic, and clinical applications because it relates information from various technical disciplines, including chemistry, genetics, cell imaging, and chip- or microarray-based protein or DNA analyses, to cell function and physiology. In practice, proteomics requires detailed analyses of complex data for a large number of proteins in a short time period. Parameters of protein analysis include not only primary amino acid sequence, but also deletions, splice rearrangements, polymorphisms, mutations, substitutions, and other post-translational modifications (PTMs), such as phosphorylation, acetylation, nitration, sulfonation, oxidation, methylation, glycosylation, cross-linking. High throughput analysis of proteins and their related forms is critical for research in biology, physiology, and medicine and can be used in clinical diagnostic applications.
Mass spectrometry (MS) is a potentially valuable tool in proteomics because highly sensitive measurements of mass can identify some proteins by their amino acid sequence. (Aebersold and Goodlett, Chem. Rev. 101: 269-295, 2001; reviewed in Mann, et al., 2001, Ann. Rev. Biochemistry 70:437; Kinter and Sherman, Protein sequencing and Identification Using Tandem Mass Spectrometry, Wiley, N.Y., 2000). Because each amino acid or chain of amino acid residues can theoretically be detected by an accurate measurement of its mass, a sufficiently accurate measurement of mass allows the identification of the individual amino acids. When the sample processing and MS techniques are highly accurate, the actual sequence of amino acids that form a polypeptide molecule can be determined. Further, if a highly accurate and reliable method detects a deviation from the known mass for an amino acid, this can indicate that the amino acid has been modified, thus allowing detection of the modifications to protein structure described above that are often highly important in proteomics research, such as deletions, splice rearrangements, polymorphisms, mutations, substitutions, and post-translational modifications.
Mass spectrometry (MS) involves the analysis of ionized analytes in a gas phase using an ion source that ionizes the analyte, a mass analyzer that measures the mass-to-charge (M/Z) ratio of the ionized analytes, and a detector that registers the number of ions at each m/z value. The MS apparatus may also be coupled to separation techniques to improve the ability to analyze complex mixtures. Further, MS instrument combinations can be made to enhance sensitivity and selectivity. A wide range of MS instruments are available for use in protein sequencing. Regarding ion source, electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) are two commonly used techniques to ionize the proteins or peptides for analysis. ESI ionizes the analytes from a solution and MALDI desorbs and ionizes the sample, using a “matrix” that encourages desorption and ionization when exposed to light energy. MALDI produces predominantly singly charged ions from peptides. As described in more detail below, tandem MS/MS is a technique that uses at least two MS components and is a commonly used methodology for MS analysis of polypeptides.
There are several types of mass analysers, including ion trap, time-of-flight (TOF), quadrupole, magnetic sector, and Fourier transform ion cyclotron (FT-MS) analyzsers, each varying in analysis characteristics. These analysers may be run separately or assembled in tandem to maximize sensitivity and strengths of MS analysis. For example, a MALDI ion source is usually coupled to a TOF analyser, but may also be coupled to quadrupole ion-trap and to combined TOF instruments or FT-MS. For example, in TOF-TOF, two TOF sections are separated by a collision cell. In the hybrid quadrupole TOF apparatus, the collision cell is placed between a quadrupole mass filter and a TOF analyser. These examples illustrate how “tandem” mass spectrometry apparatus may be assembled from intact MS apparatus or selected components of the instruments. The fundamental characteristic of tandem MS is the structural information obtained from the fragmentation pattern of the ion. The design of the tandem MS/MS instrument allows versatility and increased sensitivity depending on the goal of the analysis and the chemical composition of the analyte. Of the MS equipment available, MALDI-MS/MS is a preferred method for peptide analysis, although others may be used. Aebersold and Goodlett, 2001; Cramer and Corless, Rapid Comm. in Mass Spectrom. 15: 2058-2066, 2001; see Aebersold and Mann, 2003 for other MS instrument combinations.
Polypeptide analysis by mass spectrometry is facilitated by the ability to obtain an accurate mass measurement of a group of peptides derived from a protein by fragmentation that occurs at specific amino acid sequences after using specific cleavage enzymes for proteolysis. The principle behind protein identification assumes that proteins of different amino acid sequence will, after proteolysis with a defined protease, produce a collection of peptides the masses of which constitute protein mass fingerprints unique to a specific protein. If a sequence database containing the specific protein sequence is searched using selected masses based on the experimentally and accurately observed peptide mass fingerprint, combined with the fragmentation rules of the protease, then the protein is expected to be correctly identified within the database. As described in more detail below, there are several circumstances where the experimentally observed mass spectra do not translate into a correct prediction of the actual protein composition or sequence.
Protein identification by this method involves a few basic steps: (i) Peptides are generated by digestion of the sample protein using amino acid sequence-specific cleavage reagents that allow the residues at the carboxyl- or amino-terminus to be known with a reasonable degree of certainty. For example, the enzyme trypsin leaves arginine (R) or lysine (K) at the carboxyl-terminus of digestion fragments. Accordingly, the N-termini of tryptic peptides (except for the N-terminal one) may be identified as the amino acid following a K or R residue in the protein sequence. (ii) Following digestion, the masses of peptides or polypeptides are measured as accurately as possible in a mass spectrometer. (iii) The experimental protein fragment mass data are run through a computer and compared with data in a computer database and using the rules that apply to the proteolytic method used in the experiment to generate a list of theoretical masses that are compared to the set of measured masses. (iv) An algorithm is used to compare the set of measured peptide masses against those sets of masses predicted for each protein in the database and to assign a score to each match that ranks the quality of the matches. This approach is frequently called “in silico” digestion and the correct protein identification by mass analysis depends on the correlation of the measured masses with corresponding data contained in a database. However, several difficulties exist with this approach. Obviously, for a protein to be identified its sequence has to exist in the sequence database being used for comparison. Also, digests of protein mixtures present a problem for mass analysis because it is not readily apparent which peptides in the complex peptide mixture originate from a specific protein. An increase in accuracy of measurement will decrease the potential error for matching an experimental mass to a corresponding mass in a sequence database, and therefore will increase the stringency of the database search.
If a pure protein is digested, and the resulting peptide masses are compared with the list of peptide masses predicted for that protein, two observations are typically made. First, not all of the predicted peptides are detected. Second, some of the measured peptide masses are not present in the list of masses predicted from the protein. The first problem, the missing masses, is usually due to a number of problems that can occur both before and during mass spectrometric analysis such as poor solubility, selective absorption, ion suppression, selective ionization, very short or very long peptide length, missed or inappropriate proteolytic cleavage or other artifacts that cause sample loss or make specific peptides poorly detected or undetectable by MS. This is a critical drawback because missing peptide masses may contain meaningful biological information. Unfortunately, it is not possible to distinguish between trivial and meaningful missing masses without further experimentation. Therefore, unassigned peptide masses are a significant problem for protein identification by mass analysis and probably the single biggest source of misidentifications or missed identifications.
Fragment ion spectra are generated by a process called collision-induced dissociation (CID) in which the amide bonds of a peptide are broken, followed by recording of the fragment ion spectrum. Cleavage of amide bonds results in b-ions (containing the N-terminal) and y-ions (containing the C-terminal). High quality MS/MS spectra of tryptic peptides typically show prominent b and y-ion series. If only these two ions were produced for every amide bond in a 10 residue peptide, the fragment ion spectrum would contain 18 peaks. Ideally, long stable ion series of predominately either the b or y-type would be recovered. In reality, peptide fragmentation is variable and moiety dependent, which leads to gaps and difficulties in analysis. Determining the identity and sequence of a peptide from its MS/MS spectrum is complicated both by the variety and variability of the fragment ions produced. Factors that complicate interpretation of MS/MS spectra are missing ion subsets, internal rearrangements, subsequent fragmentations, and multiple charge states. Also to be considered are the relationship of fragment ion peak intensity to ion series origin and fragment masses, influence of amino acid residues and their derivatives, on neighboring amide bond cleavages, and the link between amino acid composition and neutral loss fragmentation.
There are currently several approaches to MS protein de novo sequencing that vary with the size and purity of the protein to be analyzed. Although some data have been published, the MS sequencing analysis of partially purified undigested proteins (termed top-down sequencing), or expression analysis of proteins from whole cells, is still technically difficult partly because of the sample complexity (Zabrouskov et al., Mol. Cell. Proteomics 2:1253, 2003; Sze et al., PNAS 99: 1774-1779, 2002).
Tandem MS analysis of peptides followed by computerized database searching is also common in high-throughput proteomics research. Recent advancements in multidimensional separation technologies and automated data collection and analysis have further increased the throughput of this method for analyzing polypeptides in biological samples. However, a major drawback of this method remains a strict dependence on high quality experimental MS spectra because a theoretical peptide sequence is determined by matching the experimental spectra with the theoretical ones generated in silico. Although more and more genomes of different organisms are being sequenced, the databases still fall short of the entire collection of model organisms currently employed in biological research today. In addition, genome-derived predicted polypeptide sequence information often fails to reliably predict actual polypeptide information due to database errors, imperfect knowledge of transcript splicing (often employed in eukaryotic cells) as well as post-transitional modifications of polypeptides. The number of post-translational chemical and enzymatic modifications known to occur for proteins and peptides continues to increase. Currently, over 200 post-translational modifications of proteins are known. As the variety, breadth and frequency of such modifications are appreciated, the probability of perfect mass spectral matches to database-generated MS spectra must decrease. Thus, these biological processes may greatly hamper database searching and accurate sequence determination of proteins in biological samples.
Recent publications show that improved approaches of MS analysis can identify protein isoforms originating from alternative mRNA splicing, single-point mutations, and co- and post-translational modifications (reviewed by Mann and Jensen, Nat. Biotech. 21: 255-261, 2003) Chemical derivatizations can be combined with affinity chromatography to identify specific amino acid modifications. Esterification of negatively charged amino acid residues before immobilized metal affinity column chromatography followed by MS/MS analysis improved identification of phosphopeptides (Ficarro, Nat. Biotechnol. 20: 301-305, 2002). MacCoss used capillary multidimensional liquid chromatography followed by MS/MS analysis to analyze proteins digested with three different proteolytic enzymes and obtained sequence results for overlapping peptides, which reduced ambiguity in mapping modifications, and detected phosphorylation sites (MacCoss et al., PNAS 99: 7900-7905, 2002). Claverol et al. used a strategy combining gel separated proteins and ESI-MS/MS to determine phosphorylation and saccharidic motifs of casein (Claverol, et al., Mol. Cell. Proteomics 2: 483-493, 2003). Chemically induced protein modifications from toxin exposure were identified using a combination of MALDI-TOF with targeted LC-MS/MS (Person, et al. Chem. Res. Toxicol 16: 598-608, 2003).
Cagney noted their experimental results were typical of peptide MS/MS experiments in that long but incomplete y-ion series were observed (Cagney and Emili, 2002). Most de novo peptide MS/MS spectra are either incomplete, or too complicated to be accurately interpreted for sequencing peptides. This is mainly due to difficulties of directionality (distinction of N-terminal ions from C-terminal ions), low efficiency of fragmentation, internal fragmentation, the presence of different types of ions generated during fragmentation (i.e. types b, y, a, c, x and z), the presence of incomplete set of ions of the b and y series, and their tendency to lose NH3 and H2O groups. These various fragmentation ions can be generated at greatly varying amounts, each with a characteristic ability to be detected in the mass spectrometer. Thus, MS/MS spectra of polypeptides can present as a highly complex series of apparent masses present at greatly varying intensities. Due to the inherent complexity of MS/MS spectral appearance, de novo peptide sequencing has not fully been enabled for polypeptide sequence determination. The presence of sequence errors and compounding factors such as polymorphism, differential splicing, or protein post-translational modifications generate a need for effective de novo sequencing strategies (Cagney, 2002). There would be great advantage to proteomics if the sequence of peptides could be sequentially determined directly by MS/MS spectral analysis.
Attempts at de novo sequencing have focused on addressing the technical difficulties of directionality and labile peptide bonds to simplify or enhance the spectral readout while maintaining accuracy of amino acid definition. Additionally, not all peptides can be resolved due to inherent chemical structure and varied propensity to fragmentation during MS analysis. Several amino acids pose specific difficulties, e.g., isoleucine and leucine have identical masses (isomeric); the masses of lysine and glutamine are similar (isobaric) and difficult to distinguish; the amide bonds linking acidic amino acids aspartic acid and glutamic acid to other amino acids are more labile than other amide bonds, imparting a fragility to the peptide at these sites; the amino acid located just subsequent to the N-terminal amino acid tends to be resistant to fragmentation; and histidine and proline are very difficult to analyze, especially proline adjacent to aspartic acid. Given these technical difficulties and the complex data analysis required, it is not unexpected that faulty or incomplete mass spectral analysis would introduce errors in protein sequences in de novo protein sequencing.
Recently, MS/MS based methods including isotopic labeling and chemical derivatization have improved MS spectral readout (reviewed in Cagney and Emili, 2002). The use of 16O/18O labeling improves identification of y-ions, but also reduces the signal intensity (Munchbach et al., Anal. Chem. 72: 4047-4057, 2000; Uttenweiler-Joseph et al., Proteomics 1: 668, 2001). An alternative approach involves methyl esterification of the carboxyl groups in a peptide (Hunt, et al., PNAS 83: 6233, 1986; Goodlett, et al., Rapid Commun. Mass Spectrom. 15: 1214, 2001.) This reaction increases the mass for aspartic and glutamic acid carboxylic side chains, and also modifies the C-terminal carboxyl group. However, for both isotopic labeling and methylation, the modified spectra must still be compared with the original, underivatized peptide spectra. Accordingly, that chemical labeling of peptides may require additional experimental and computational steps that may slow down high-throughput sequencing. Mass spectrometry (MS) involves the analysis of ionized analytes in a gas phase using an ion source that ionizes the analyte, a mass analyzer that measures the mass-to-charge (M/Z) ratio of the ionized analytes, and a detector that registers the number of ions at each m/z value. The MS apparatus may also be coupled to separation techniques to improve the ability to analyze complex mixtures. Further, MS instrument combinations can be made to enhance sensitivity and selectivity. A wide range of MS instruments are available for use in protein sequencing. Regarding ion source, electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) are two commonly used techniques to ionize the proteins or peptides for analysis. ESI ionizes the analytes from a solution and MALDI desorbs and ionizes the sample, using a “matrix” that encourages desorption and ionization when exposed to light energy. MALDI produces predominantly singly charged ions from peptides. As described in more detail below, tandem MS/MS is a technique that uses at least two MS components and is a commonly used methodology for MS analysis of polypeptides.
Mass spectrometry (MS) involves the analysis of ionized analytes in a gas phase using an ion source that ionizes the analyte, a mass analyzer that measures the mass-to-charge (M/Z) ratio of the ionized analytes, and a detector that registers the number of ions at each m/z value. The MS apparatus may also be coupled to, separation techniques to improve the ability to analyze complex mixtures. Further, MS instrument combinations can be made to enhance sensitivity and selectivity. A wide range of MS instruments are available for use in protein sequencing. Regarding ion source, electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) are two commonly used techniques to ionize the proteins or peptides for analysis. ESI ionizes the analytes from a solution and MALDI desorbs and ionizes the sample, using a “matrix” that encourages desorption and ionization when exposed to light energy. MALDI produces predominantly singly charged ions from peptides. As described in more detail below, tandem MS/MS is a technique that uses at least two MS components and is a commonly used methodology for MS analysis of polypeptides.
Chemical modification of the N-terminus of a peptide before MS analysis has been found to improve MS analysis. The incorporation of a quaternary ammonium group at the N-terminus using the reactive N-hydroxysuccinimidyl ester enhanced sensitivity in MALDI MS (Bartlet-Jones, et al., Rapid Comm. Mass Spectrom. 8: 737, 1994). Cardenas, et al reacted peptides with N-succinimidyl-2-(3-pyridyl)acetate, followed by liquid chromatography separation and analysis by ESI-MS/MS (Cardenas, et al., Rapid Comm. Mass Spectrum. 11:1271-1278, 1997). This reaction modified the N-terminal amino acids and the amino group of lysine. Keough et al. reported the addition of a sulfonic acid group to the N terminus of tryptic peptides increases fragmentation sensitivity and produces much higher fragment ion yields than native peptides. (WO 02/08767; 2003/0032056; WO 02/095419; PNAS 96: 7131-7134, 1999; Rapid Commun. Mass Spectrom 15: 2227-2239, 2001). Destabilization of amide bonds by protonation of amide nitrogen produced extensive fragmentation under MALDI and ESI ionizing conditions (AP MALDI in combination with ion trap MS). The MS/MS spectra of sulfonized peptides containing aspartic, glutamic and oxidized methionine showed more uniform fragmentation along the peptide backbone. Additionally, Keogh, et al. observed the preferential fragmentation on the N-terminal side of proline residues, enhancing recognition of proline.
Chemical modification of the C-terminal amino acid of the peptide before analysis has been found to form longer, more stable series of y-ions. Several methods of C-terminal chemical modification have been reported for lysine. As noted above, trypsin digestion is routinely used in polypeptide analysis by MS to produce fragmentation because the resulting fragment will reliably end in arginine (R) or lysine (K), thus establishing the C-terminal moiety. Although arginine is known to produce an exceptionally strong MS signal, lysine can be difficult to detect. However, lysine can be chemically modified to improve its signal (see Peters, WO 03/056299). This modification distinguishes the mass of lysine from that of glutamine. Cagney and Emili (2002) used a similar approach by differential guanidination of C-terminal lysines followed by LC-ESI-MS/MS analysis (Cagney and Emili, Nat. Biotech. 20: 163-170, 2002). Gu et al (Gu et al., J. Am. Soc. Mass Spectrom. 14: 1-7, 2003) utilized a method incorporating deuterium-labeled (heavy) lysine.
Peters et al. (Peters, et al., WO 03/056299) described a different chemical derivatization method for C-terminal lysine and demonstrated that when the polypeptide's C-terminal lysine was modified by a particular class of reagents, for example 2-methoxy-4,5-dihydro-1-H-imidazole (referred to as “imidazole”), the complexity of the resulting MS/MS spectra was greatly reduced. Peters et al. noted that the y-ion series identification was improved thereby permitting assignment of amino acid sequences more accurately.
Simplification of MS/MS spectra by chemical derivatization of peptides, and the subsequently improved ability to identify the amino acid sequence data, illustrates the potential for developing high quality fragmentation spectra, obtaining long series of complete b, and especially y-ion series, and offers a practical approach to de novo sequencing. An improved resolution in de novo mass measurements increases the accuracy of sequence determination, and decreases reliance on predictive in silico sequence analysis of proteins. However, while chemical modification can increase the reliability and utility of MS analysis and improves the capability for de novo sequencing, several uniquely problematic technical challenges have not been solved and numerous biologically important characteristics of peptides cannot currently be elucidated by existing MS techniques. Moreover, the reliance on computer databases for peptide sequences and protein identification always involves predictions and approximations rather than experimental data, and thereby, increases the possibility for error that cannot be detected from the data. Therefore, ideally, the mass analysis of polypeptides would permit an accurate and reliable polypeptide sequence that would utilize a de novo identification of each amino acid in the peptide.
Given the inherent complexity of peptide fragmentation and the difficulties of MS spectral analysis, a combination of different methods for chemical derivatization of peptides has not been completely developed. For proteomics and analysis of complex mixtures of peptides, it is accepted that only very simple and extremely efficient chemical derivatization steps are compatible with proteomics. If any heterogeneity is introduced by the chemical reaction, the peptide samples become even more complex, thereby complicating the MS analysis and subsequent data processing. (Mann and Jensen, Nat. Biotech. 21:255-261, 2003). Therefore, although chemical derivatization is a known procedure for use in mass spectrometry, the use of multiple discrete derivatization techniques would be expected to introduce significant complexity and complication to a peptide mass analysis and the use of de novo sequencing for a complete determination of the linear amino acid sequence of a peptide is still difficult.