The large scale study of structure and function of proteins and peptides is often termed proteomics. Proteins are vital parts of living organisms, as they are the main components of most physiological pathways in cells. The term “proteomics” was coined to make an analogy with genomics, the study of genes. The word “proteome” is a portmanteau of “protein” and “genome”. The proteome of an organism is the set of proteins that are produced by the organism during its life, and its genome is its total set of genes.
Proteomics is often considered the next step in the study of biological systems, after genomics. The genome contains all the information required to construct an organism's protein compliment. However, proteomics is much more complicated than genomics, mostly because while an organism's genome is rather constant, a proteome differs from cell to cell and constantly changes via biochemical interactions dictated by its immediate environment. One organism has the same genome in nearly every cell, nevertheless, it may have a radically different protein expression profile in different parts of its body, different stages of its life cycle and different environmental conditions. Another major difficulty is the complexity of proteins relative to genes. For example, the human genomes consists of approximately 25 000 genes but it is estimated that more than 500 000 proteins can be derived from these genes. This increased complexity derives from mechanisms such as alternative splicing, post-translational protein modification (such as glycosylation or phosphorylation) and protein degradation.
Since proteins play a central role in the life of an organism, proteomics is instrumental in discovery of biomarkers, components that can indicate a particular disease. Current research in proteomics requires that the primary sequence of proteins be resolved, sometimes on a massive scale.
Many techniques have been developed for protein sequencing, including deriving the amino acid sequence from a DNA or RNA sequence or directly from the protein itself, such as Edman degradation or analysis by Mass Spectrometry. Nowadays, Mass spectrometry seems the method of choice for direct protein sequencing and a typical proteomics analysis may consist of the following five stages.
In stage 1, the proteins to be analysed are isolated from a biological source such as a cell lysate or tissue for instance by biochemical fractionation or affinity selection. This stage often includes a final step of two-dimensional gel electrophoresis, which usually separates proteins first by isoelectric point and then by molecular weight. Protein spots in a gel can be visualized using a variety of chemical stains or fluorescent markers. Proteins can often be quantified by the intensity of their stain. Once proteins are separated and quantified, they may be identified. Individual spots are cut out of the gel so that they contain a purified single protein species. The mass spectrum analysis of whole proteins is less sensitive than that of smaller peptides and the mass and charge of the intact protein by itself is insufficient for the identification of its primary amino acid sequence. Mass spectrometry can, in principle, sequence any size of protein, but the problem becomes computationally more difficult as the size increases. Proteolytic peptides are also easier to prepare for mass spectrometry than whole proteins, because they are more soluble. Therefore, proteins are preferably degraded into smaller proteolytic peptides in stage 2, for instance through enzymatic digestion. It should be noted that in certain cases, stage 1 may be omitted and the analyte of interest is directly subjected to stage 2.
In stage 2, degradation typically occurs enzymatically, for instance by trypsin digestion. Trypsin is a serine endoprotease found in the digestive system and catalyses the hydrolysis of peptide bonds, leading to proteolytic peptide fragments with C-terminally protonated amino acids. Trypsin predominantly cleaves proteins at the carboxyl side (or C-terminal side) of the amino acids lysine and arginine, except when either is followed by proline. In order to generate overlapping proteolytic peptide fragments, it is advantageous to use multiple enzymes with different specificities in this stage. Whereas trypsin is most commonly used, other enzymes employed for this purpose include pepsin, elastase, Lys-C, V8 (a Glu-C endoproteinase) and chymotrypsin.
In stage 3, the proteolytic peptide fragments are separated and delivered to the mass spectrometer. Separation may be achieved by one or more steps of liquid chromatography (LC) such as high-pressure liquid chromatography (HPLC) using narrow-bore (often below 100 micron) columns. One method of delivering the peptides to the spectrometer is electrospray ionization (ESI). At the end of the HPLC column, the solution is sprayed out of a narrow nozzle charged to a high positive potential into the mass spectrometer. The charge on the droplets causes them to fragment until only intact protonated proteolytic peptides remain, often termed peptide precursor ions. Matrix-assisted laser desorption/ionization (MALDI) is another technique commonly used to volatize and ionize the proteolytic peptides for mass spectrometric analysis. ESI ionizes the analytes out of a solution and is therefore readily coupled to liquid-based (for example, chromatographic and electrophoretic) separation tools, MALDI sublimates and ionizes the samples out of a dry, crystalline matrix via laser pulses. MALDI-MS is normally used to analyze relatively simple peptide mixtures, whereas integrated liquid-chromatography ESI-MS systems (LC-MS) are preferred for the analysis of complex samples.
In stage 4, a mass spectrum of the peptides eluting at a particular time point is taken (MS1 spectrum, or ‘normal mass spectrum’). Mass spectrometric measurements are carried out in the gas phase on ionized analytes (protonated proteolytic peptides, peptide precursor ions). By definition, a mass spectrometer consists of an ion source, a mass analyser that measures the mass-to-charge ratio (m/z) of the ionized analytes, and a detector that registers the number of ions at each m/z value.
In stage 5, the computer generates a prioritized list of these peptide precursor ions for fragmentation and a series of tandem mass spectrometric or ‘MS/MS’ analyses ensues. The first stage of tandem MS/MS isolates individual peptide precursor ions, and the second breaks the peptide precursor ions into peptide fragment ions and uses the fragmentation pattern to determine their amino acid sequences. The MS and MS/MS spectra are typically acquired for about one second each and stored for matching against protein sequence databases.
The outcome of the analysis is the amino acid sequence of the proteolytic peptide fragments and therefore the peptides and proteins making up the (purified) protein population.
In mass spectrometry, collision-induced dissociation (CID), referred to by some as collisionally activated dissociation (CAD), is currently the method of choice. CID is a mechanism by which to fragment peptide precursor ions in the gas phase. The peptide precursor ions are usually accelerated by some electrical potential to high kinetic energy in the vacuum of a mass spectrometer and then allowed to collide with neutral gas molecules (often helium, nitrogen or argon). The collisions allow some of the kinetic energy to be converted into internal energy which results in bond breakage and the fragmentation of the peptide precursor ion into smaller fragments. These fragment ions can then be analyzed by a mass spectrometer.
CID is frequently used as part of tandem mass spectrometry in proteomics analyses. While CID is currently the most popular method for standard tandem mass spectrometry, there are also other fragmentation methods, for example electron transfer dissociation (ETD) and electron capture dissociation (ECD).
These different fragmentation techniques lead to the appearance of different types of ion fragments. A nomenclature for various ion types was first suggested by P. Roepstorff and J. Fohlman (Proposal for A Common Nomenclature for Sequence Ions In Mass-Spectra of Peptides. Biomed. Mass Spectrom. 11, 601-601 (1984)) and subsequently modified as described by K. Biemann (Contributions of Mass-Spectrometry to Peptide and Protein-Structure. Biomed. and Env. Mass Spectrom. 16, 99-111 (1988)). Typically y, b (cleavage of the peptide bond) and a fragments (formally a loss of CO from a b ion) are observed in CID. This is schematically depicted in FIG. 1.
ECD is usually considered a more direct fragmentation technique as compared to CID. In contrast to CID, ECD involves the introduction of low energy electrons to trapped gas phase ions. ECD produces significantly different types of fragment ions than CID. The unique (and complementary) fragments observed and the ability to fragment whole macromolecules effectively has been considered the most promising features of ECD. However, the low efficiencies and other technical difficulties have prevented wide spread use. ECD is primarily used in Fourier transform ion cyclotron resonance mass spectrometry.
ETD does not use free electrons but employs radical anions such as for example anthracene or azobenzene. When these anions react with positively charged peptide precursor ions an electron is transferred leading to the formation of c and z peptide fragment ions (FIG. 1). ETD cleaves peptide bonds randomly along the peptide backbone while side chains and modifications such as phosphorylation are usually left intact. The technique works well for higher charge state ions (z>2).
Analysis of some post-translational modifications (PTMs), such as phosphorylation, sulfonation, and glycosylation, is difficult with CID since the modification is often labile and preferentially lost over peptide backbone fragmentation, resulting in little to no peptide sequence information. The presence of multiple basic residues also makes peptides exceptionally difficult to sequence by conventional CID mass spectrometry. In a recent review, the utility of ETD mass spectrometry for sequence analysis of post-translationally modified and/or highly basic peptides was investigated (Molina et al., Proceedings of the National Academy of Sciences of the United States of America 104, (2007) 2199-2204). Phosphorylated, sulfonated, glycosylated, nitrosylated, disulfide bonded, methylated, acetylated, and highly basic peptides were analyzed by CID and ETD mass spectrometry. It was concluded that ETD is an excellent method for localization of phosphorylation sites. This illustrates the utility of ETD as an advantageous tool in phosphoproteomics research.
Protein identifications using peptide CID spectra are more clear-cut than those achieved by mass mapping because, in addition to the peptide mass, the peak pattern in the CID spectrum also provides information about peptide sequence.
This information however, is not readily convertible into a full, unambiguous peptide sequence, therefore, CID is generally considered not very suitable for automated de novo sequencing. Instead, the CID spectra are scanned against comprehensive protein sequence databases using one of a number of different algorithms, each with its strengths and weaknesses.
The ‘peptide sequence tag’ approach extracts a short, unambiguous amino acid sequence from the peak pattern that, when combined with the mass information, is a specific probe to determine the origin of the peptide.
In the ‘cross-correlation’ method, peptide sequences in the database are used to construct theoretical mass spectra and the overlap or ‘cross-correlation’ of these predicted spectra with the measured mass spectra determines the best match.
In the third main approach, ‘probability based matching’, the calculated fragments from peptide sequences in the database are compared with observed peaks. From this comparison a score is calculated which reflects the statistical significance of the match between the spectrum and the sequences contained in a database.
In each of these methods the identified peptides are compiled into a protein ‘hit list’, which is the output of a typical proteomic analysis. Because protein identifications rely on matches with sequence databases, high-throughput proteomics is currently restricted largely to those species for which comprehensive sequence databases are available.
For those species where no genomic sequence information is available, amino acid sequencing can only be done “de novo”. i.e. without database matching. The spectra usually obtained in the above described prior art methods are hardly if at all suited for de novo sequencing since they do not provide unambiguous and easy to read sequence information.
The present invention addresses this problem and provides a method for determining the amino acid sequence of a peptide wherein MS spectra are produced with very clear and unambiguous sequence information which is interpretable without the help of comprehensive databases. The method according to the invention provides MS/MS spectra containing predominantly c ions in sequential order. The method according to the invention is therefore particularly suited for de novo sequencing. Moreover, it provides an improved method for the analysis of post-translational modifications of proteins.