1. Field of the Invention
This invention relates to methods of identifying a protein, polypeptide or peptide by means of mass spectrometry and especially by tandem mass spectrometry (MS/MS). Preferred methods relate to the use of mass spectral data to identify an unknown protein where sequence is at least partially present in an existing database.
2. Discussion of the Prior Art
Although several well-established chemical methods for the sequencing of peptides, polypeptides and proteins are known (for example, the Edman degradation), mass spectrometric methods are becoming increasingly important in view of their speed and ease of use. Mass spectrometric methods have been developed to the point at which they are capable of sequencing peptides in a mixture without any prior chemical purification or separation, typically using electrospray ionization and tandem mass spectrometry (MS/MS). For example, see Yates III (J. Mass Spectrom, 1998 vol. 33 pp. 1-19), Papayannopoulos (Mass Spectrom. Rev. 1995, vol. 14 pp. 49-73), and Yates III, McCormack, and Eng (Anal. Chem. 1996 vol. 68 (17) pp. 534A-540A). Thus, in a typical MS/MS sequencing experiment, molecular ions of a particular peptide are selected by the first mass analyzer and fragmented by collisions with neutral gas molecules in a collision cell. The second mass analyzer is then used to record the fragment ion spectrum that generally contains enough information to allow at least a partial, and often the complete, sequence to be determined.
Unfortunately, however, the interpretation of the fragment spectra is not straightforward. Manual interpretation (see, for example, Hunt, Yates III, et al, Proc. Nat. Acad. Sci. USA, 1986, vol. 83 pp 6233-6237 and Papayannopoulos, ibid) requires considerable experience and is time consuming. Consequently, many workers have developed algorithms and computer programs to automate the process, at least in part. The nature of the problem, however, is such that none of those so far developed are able to provide in reasonable time complete sequence information without either requiring some prior knowledge of the chemical structure of the peptide or merely identifying likely candidate sequences in existing protein structure databases. The reason for this will be understood from the following discussion of the nature of the fragment spectra produced.
Typically, the fragment spectrum of a peptide comprises peaks belonging to about half a dozen different ion series each of which correspond to different modes of fragmentation of the peptide parent ion. Each typically (but not invariably) comprises peaks representing the loss of successive amino acid residues from the original peptide ion. Because all but two of the 20 amino acids from which most naturally occurring proteins are comprised have different masses, it is therefore possible to establish the sequence of amino acids from the difference in mass of peaks in any given series which correspond to the successive loss of an amino acid residue from the original peptide. However, difficulties arise in identifying to which series an ion belongs and from a variety of ambiguities that can arise in assigning the peaks, particularly when certain peaks are either missing or unrecognized. Moreover, other peaks are typically present in a spectrum due to various more complicated fragmentation or rearrangement routes, so that direct assignment of ions is fraught with difficulty. Further, electrospray ionization tends to produce multiply charged ions that appear at correspondingly rescaled masses, which further complicates the interpretation of the spectra. Isotopic clusters also lead to proliferation of peaks in the observed spectra. Thus, the direct transformation of a mass spectrum to a sequence is only possible in trivially small peptides.
The reverse route, transforming trial sequences to predicted spectra for comparison with the observed spectrum, should be easier, but has not been fully developed. The number of possible sequences for any peptide (20n, where n is the number of amino acids comprised in the peptide) is very large, so the difficulty of finding the correct sequence for, say, a peptide of a mere 10 amino acids (2010=1013 possible sequences) will be appreciated. The number of potential sequences increases very rapidly both with the size of the peptide and with the number (at least 20) of the residues being considered.
Details of the first computer programs for predicting probable amino acid sequences from mass spectral data appeared in 1984 (Sakurai, Matsuo, Matsuda, Katakuse, Biomed. Mass Spectrom, 1984, vol. 11 (8) pp 397-399). This program (PAAS3) searched through all the amino acid sequences whose molecular weights coincided with that of the peptide being examined and identified the most probable sequences with the experimentally observed spectra. Hamm, Wilson and Harvan (CABIOS, 1986 vol. 2 (2) pp 115-118) also developed a similar program.
However, as pointed out by Ishikawa and Niwa (Biomed. and Environ. Mass Spectrom. 1986, vol. 13 pp 373-380), this approach is limited to peptides not exceeding 800 daltons in view of the computer time required to carry out the search. Parekh et al in UK patent application 2,325,465 (published November 1998) have resurrected this idea and give an example of the sequencing of a peptide of 1000 daltons which required 2xc3x97106 possible sequences to be searched, but do not specify the computer time required. Nevertheless, despite the increase in the processing speed of computers between 1984 and 1999, a simple search of all possible sequences for a peptide of molecular weights greater than 1200 daltons is still impractical in a reasonable time using the personal computer typically supplied for data processing with most commercial mass spectrometers.
This problem has long been recognized and several approaches to rendering the problem more tractable have been described. One of the most successful has been to correlate the mass spectral data with the known amino acid sequences comprised in a protein database rather than with every possible sequence. In the prior method known as peptide mass mapping, a protein may be identified by merely determining the molecular weights of the peptides produced by digesting it with a site-specific protease and comparing the molecular weights with those predicted from known proteins in a database. (See, for example, Yates, Speicher, et al in Analytical Biochemistry, 1993 vol 214 pp 397-408). However, mass mapping is ineffective if a protein or peptide comprises only a small number of amino acids residues or possible fragments, and is inapplicable if information about the actual amino acid sequences is required. As explained, tandem mass spectrometry (MS/MS) can be used to provide such sequence information. MS/MS spectra usually contain enough detail to allow a peptide to be at least partially, and often completely sequenced without reference to any database of known sequences (See copending application GB 9907810.7, filed Apr. 6, 1999). There are, however, many circumstances where it is adequate, or even preferred, to establish sequences by reference to an existing database. Such methods were pioneered by Yates, et al, see, for example, PCT application 95/25281, Yates (J. Mass Spectrom 1998 vol 33 pp 1-19), Yates, Eng et al (Anal. Chem. 1995 vol 67 pp 1426-33). Other workers, including Mortz et al (Proc. Nat. Acad. Sci. USA, 1996 vol 93 pp 8264-7), Figeys, et al (Rapid Commun. Mass Spectrom. 1998 vol 12 pp 1435-44), Jaffe, et al, (Biochemistry, 1998 vol 37 pp 16211-24), Amot et al (Electrophoresis, 1998 vol 19 pp 968-980) and Shevchenko et al (J. Protein Chem. 1997 vol 16 (5) pp 481-490) report similar approaches.
As explained, it is generally easier to predict a fragmentation mass spectrum from a given amino acid sequence than to carry out the reverse procedure when comparing experimental MS data with sequence databases. A xe2x80x9cfragmentation modelxe2x80x9d that describes the various ways in which a given amino acid sequence may fragment is therefore required. The chemical processes which result in fragmentation are fairly well understood, but because the number of possible routes increases very rapidly with the number of amino acid residues in a sequence it is difficult to build this knowledge into a definite model. The fragmentation models so far proposed (for example Eng et al, J. Am. Soc. Mass Spectrom, 1994 vol 5 pp 976-89) typically incorporate only a small number of possible fragmentation routes and typically produce a predicted spectrum in which all the mass peaks have equal probability. This constrained approach compromises the accuracy of the comparison with an experimental spectrum, which is likely to represent the sum of many different fragmentation pathways operating simultaneously with different degrees of importance. Consequently the degree of confidence that can be placed in the identification of a sequence on the basis of the prior fragmentation models is reduced and the chance of an incorrect identification is increased.
As explained in our copending application (GB 9907810.7, filed Apr. 6, 1999) a realistic fragmentation model is also required to predict spectra from pseudo-randomly generated trial sequences (as opposed to existing sequences comprised in a database). The fragmentation models described in the present application are applicable to both approaches.
It is an object of the present invention to provide an improved method of modelling the fragmentation of a peptide or protein in a tandem mass spectrometer to facilitate comparison with an experimentally determined spectrum. It is another object of the invention to provide such a fragmentation model which takes account of all possible fragmentation pathways which a particular sequence of amino acids may undergo. A further object of the invention is to provide methods of identifying a peptide or protein by comparing an experimentally determined mass spectrum with spectra predicted using such a fragmentation model from a library of known peptides or proteins. It is yet another object of the invention to provide a de novo method of determining the amino acid sequence of an unknown peptide using such a fragmentation model.
In accordance with these objectives the invention provides a method of identifying the most probable amino acid sequences which would account for the mass spectrum of a protein or peptide, said method comprising the steps of:
a) producing a processable mass spectrum from said peptide; and
b) using a fragmentation model to calculate the likelihood that any given trial amino acid sequence would account for said processable spectrum, said fragmentation model comprising the step of summing probabilistically a plurality of fragmentation routes which together represent the possible ways that said trial sequence might fragment in accordance with a set of predefined rules, each said fragmentation route being assigned a prior probability appropriate to the chemical processes involved.
In preferred methods, said plurality of fragmentation routes represent all the possible ways that a said trial sequence might fragment.
Preferably the fragmentation model is based on the production of at least two series of ions, the b series (which comprises ions representing the N-terminal residue of the trial sequence and the loss of C-terminal amino acid residues), and the yxe2x80x3 series (which comprises ions representing the C-terminal residue and the loss of N-terminal amino acid residues). Each family of ions behaves as a coherent series, with neighbouring ions likely to be either both present or both absent. This behaviour may be described by a Markov chain, in which the probability of an ion being observed is influenced by whether or not its predecessor was observed. The parameters of the chain may be adjusted to take account of the proton affinities of the residues and their physical bond strengths. The fragmentation model may be refined by including other ion series, particularly the a series (b ions which have lost CO), the zxe2x80x3 series (yxe2x80x3 ions which have lost NH3), and the more general loss of NH3 or H2O, again taking account of the probability of the chemical processes involved. Immonium ions equivalent to the loss of CO and H from the various amino acid residues may also be included. Further, the fragmentation model may comprise the generation of sub-sequences of amino acids, that is, sequences that begin and end at amino acid residues internal to the unknown peptide. It will be appreciated that the more realistic is the fragmentation model, the better will be the accuracy and fidelity of the computation of the most probable sequences. It is therefore envisaged that different fragmentation models may be employed if advances are made in understanding the chemical mechanism by which the mass spectrum of the peptide is produced.
Each of the chemical processes described above may be assigned a prior probability on the basis of the physical strength of the bonds broken in the proposed fragmentation step and the proton affinities of the various amino acid residues, thereby enabling the prior probability of each complete fragmentation route to be calculated. However, using Markov chains to model each of the ion series produced (eg, the b or yxe2x80x3 series) means that it is unnecessary to compute an explicit spectrum for every possible fragmentation route for comparison with the processable spectrum. Instead, the method of the invention arrives at the same result by using the Markov chain representation of the various ion series to factorize the comparison, so that the likelihood summed over all the fragmentation routes can be computed in polynomial time (in the most preferred embodiment, linear time). This summed likelihood is a better basis for comparison with the processable spectrum than the likelihood or other score derived from a single fragmentation route, such as would be produced by prior fragmentation models, because the fragmentation of a real peptide involves many simultaneous routes. By the use of a fully probabilistic fragmentation model, therefore, the method of the invention automatically accounts in a quantitative sense, for this multiplicity of routes.
As explained, using Markov chains to model the fragmentation process allows the sum over all the possible fragmentation patterns to be calculated in linear time (ie, in a time proportional to the number of amino acid residues in the peptide) rather than in a time proportional to the exponentially large number of fragmentation patterns themselves. However, it will be appreciated that the invention is not limited to the particular fragmentation model described above, but includes any probabilistic fragmentation model that can be integrated computationally in polynomial time.
It will be appreciated that trial sequences used in the method of the invention may be obtained from one or more libraries or databases containing sequences or partial sequences of known peptides and proteins, or may be generated pseudo-randomly in a de-novo sequencing method, as described in our co-pending patent application (GB 9907810.7, filed Apr. 6, 1999). For example, a fragmentation model according to the invention may be used to calculate the likelihood of amino acid sequences comprised in an existing protein or peptide database accounting for an experimentally observed mass spectrum of a peptide. In this way the peptide, and/or the protein from which it is derived, may be identified. Conveniently, in such a method, only sequences or partial sequences having a molecular weight in a given range are selected from the database for input to the fragmentation model.
The method of the invention assigns a likelihood factor to each trial amino acid sequence considered. The most probable amino acid sequences in the database (or pseudo-randomly generated sequences) which would account for the processable spectrum may then be identified as the trial sequences with the highest likelihood factors. However, a more precise method that is particularly appropriate in the case of de novo sequencing, is to use a Bayesian approach. Each trial sequence is assigned a prior probability on the basis of whatever information is known about it, including its relationship to the sample from which the processable spectrum is obtained. For example, in true de novo sequencing the prior probability of a trial sequence may be based on the average natural abundances of the amino acid residues it comprises. In the case of database searches, it may be known, for example, that the sample is derived from a yeast protein, in which case, sequences in the database derived from yeasts may be assigned a higher prior probability.
The probability of a trial sequence accounting for the processable spectrum is then calculated by Bayes"" theorem, that is:
Probability (trial sequence AND processable spectrum)=Prior probability (trial sequence)xc3x97likelihood factor
In Bayesian terminology, the likelihood factor is:
Probability (processable spectrum GIVEN trial sequence).
Although in certain simple cases the processable mass spectrum may simply be the observed mass spectrum, it is generally preferable to convert the observed spectrum into a more suitable form before attempting to sequence the peptide. Preferably, the processable spectrum is obtained by converting multiply-charged ions and isotopic clusters of ions to a single intensity value at the mass-to-charge ratio corresponding to a singly-charged ion of the lowest mass isotope, and calculating an uncertainty value for the actual mass and the probability that a peak at that mass-to-charge ratio has actually been observed. Conveniently, the uncertainty value of a peak may be based on the standard deviation of a Gaussian peak representing the processed peak and the probability that a peak is actually observed may be related to the signal-to-noise ratio of the pea k in the observed spectrum. The program xe2x80x9cMaxEnt3(trademark)xe2x80x9d available from Micromass UK Ltd. may be used to produce the processable spectrum from an observed spectrum.
In order to carry out the methods of the invention a sample comprising one or more unknown peptides may be introduced into a tandem mass spectrometer and ionized using electrospray ionization. The molecular weights of the unknown peptides may typically be determined by observing the molecular ion groups of peaks in a mass spectrum of the sample. The first analyzer of the tandem mass spectrometer may then be set to transmit the molecular ion group of peaks corresponding to one of the unknown peptides to a collision cell, in which the molecular ions are fragmented by collision with neutral gas molecules. The second mass analyzer of the tandem mass spectrometer may then be used to record an observed fragmentation mass spectrum of the peptide. A processable mass spectrum may then be derived from the observed spectrum using suitable computer software, as explained. If the sample comprises a mixture of peptides, for example as might be produced by a tryptic digest of a protein, further peptides may be analyzed by selecting the appropriate molecular ion group using the first mass analyzer.
Viewed from another aspect the invention provides apparatus for identifying the most likely sequences of amino acids in an unknown peptide, said apparatus comprising a mass spectrometer for generating a mass spectrum of a said unknown peptide and data processing means programmed to:
a) Process data generated by said mass spectrometer to produce a processable mass spectrum; and
b) Calculate the likelihood that any given trial amino-acid sequence would account for said processable spectrum using a fragmentation model which sums probabilistically over a plurality of fragmentation routes which together represent the possible ways that said trial sequence might fragment in accordance with a set of predefined rules, each said fragmentation route being assigned a prior probability appropriate to the chemical processes involved.
In preferred embodiments, apparatus according to the invention comprises a tandem mass spectrometer, and most preferably a tandem mass spectrometer that comprises a Time-of-Flight mass analyzer at least as its final stage. A Time-of-Flight mass analyzer is preferred because it is generally capable of greater mass measurement accuracy than a quadrupole analyzer. Preferably also the mass spectrometer comprises an electrospray ionization source into which an unknown peptide sample may be introduced.