1. Field of the Invention
This invention relates to methods of determining the sequence of amino acids that constitute peptides, polypeptides or proteins by mass spectrometry and especially by tandem mass spectrometry or MS/MS. In particular it relates to methods whereby the sequence can be determined from the mass spectral data alone and which do not require the use of existing libraries of protein sequence information. Methods according to the invention require no information concerning the nature of the peptide other than a library of the amino acid residues that may occur in proteins weighted according to natural abundance.
2. Discussion of the Prior Art
Although several well-established chemical methods for the sequencing of peptides, polypeptides and proteins are known (for example, the Edman degradation), mass spectrometric methods are becoming increasingly important in view of their speed and ease of use. Mass spectrometric methods have been developed to the point at which they are capable of sequencing peptides in a mixture without any prior chemical purification or separation, typically using electrospray ionization and tandem mass spectrometry (MS/MS). For example, see Yates III (J. Mass Spectrom, 1998 vol. 33 pp. 1-19), Papayannopoulos (Mass Spectrom. Rev. 1995, vol. 14 pp. 49-73), and Yates III, McCormack, and Eng (Anal. Chem. 1996 vol. 68 (17) pp. 534A-540A). Thus, in a typical MS/MS sequencing experiment, molecular ions of a particular peptide are selected by the first mass analyzer and fragmented by collisions with neutral gas molecules in a collision cell. The second mass analyzer is then used to record the fragment ion spectrum that generally contains enough information to allow at least a partial, and often the complete, sequence to be determined.
Unfortunately, however, the interpretation of the fragment spectra is not straightforward. Manual interpretation (see, for example, Hunt, Yates III, et al, Proc. Nat. Acad. Sci. USA, 1986, vol. 83 pp 6233-6237 and Papayannopoulos, ibid) requires considerable experience and is time consuming. Consequently, many workers have developed algorithms and computer programs to automate the process, at least in part. The nature of the problem, however, is such that none of those so far developed are able to provide in reasonable time complete sequence information without either requiring some prior knowledge of the chemical structure of the peptide or merely identifying likely candidate sequences in existing protein structure databases. The reason for this will be understood from the following discussion of the nature of the fragment spectra produced.
Typically, the fragment spectrum of a peptide comprises peaks belonging to about half a dozen different ion series each of which correspond to different modes of fragmentation of the peptide parent ion. Each typically (but not invariably) comprises peaks representing the loss of successive amino acid residues from the original peptide ion. Because all but two of the 20 amino acids from which most naturally occurring proteins are comprised have different masses, it is therefore possible to establish the sequence of amino acids from the difference in mass of peaks in any given series which correspond to the successive loss of an amino acid residue from the original peptide. However, difficulties arise in identifying to which series an ion belongs and from a variety of ambiguities that can arrive in assigning the peaks, particularly when certain peaks are either missing or unrecognized. Moreover, other peaks are typically present in a spectrum due to various more complicated fragmentation or rearrangement routes, so that direct assignment of ions is fraught with difficulty. Further, electrospray ionization tends to produce multiply charged ions that appear at correspondingly resealed masses, which further complicates the interpretation of the spectra. Isotopic clusters also lead to proliferation of peaks in the observed spectra. Thus, the direct transformation of a mass spectrum to a sequence is only possible in trivially small peptides.
The reverse route, transforming trial sequences to predicted spectra for comparison with the observed spectrum, should be easier, but has not been fully developed. The number of possible sequences for any peptide (20n, where n is the number of amino acids comprised in the peptide) is very large, so the difficulty of finding the correct sequence for, say, a peptide of a mere 10 amino acids (2010=1013 possible sequences) will be appreciated. The number of potential sequences increases very rapidly both with the size of the peptide and with the number (at least 20) of the residues being considered.
Details of the first computer programs for predicting probable amino acid sequences from mass spectral data appeared in 1984 (Sakurai, Matsuo, Matsuda, Katakuse, Biomed. Mass Spectrom, 1984, vol. 11 (8) pp 397-399). This program (PAAS3) searched through all the amino acid sequences whose molecular weights coincided with that of the peptide being examined and identified the most probable sequences with the experimentally observed spectra. Hamm, Wilson and Harvan (CABIOS, 1986 vol. 2 (2) pp 115-118) also developed a similar program.
However, as pointed out by Ishikawa and Niwa (Biomed. and Environ. Mass Spectrom. 1986, vol. 13 pp 373-380), this approach is limited to peptides not exceeding 800 daltons in view of the computer time required to carry out the search. Parekh et al in UK patent application 2,325,465 (published November 1998) have resurrected this idea and give an example of the sequencing of a peptide of 1000 daltons which required 2xc3x97106 possible sequences to be searched, but do not specify the computer time required. Nevertheless, despite the increase in the processing speed of computers between 1984 and 1999, a simple search of all possible sequences for a peptide of molecular weights greater than 1200 daltons is still impractical in a reasonable time using the personal computer typically supplied for data processing with most commercial mass spectrometers.
This problem has long been recognized and many attempts have been made to render the problem more tractable. For example, the MS/MS spectrum may be correlated with amino acid sequences derived from a protein database rather than every possible sequence. Such methods are taught in PCT patent application 95/25281, by Taylor and Johnson (Rapid Commun. in Mass Spectrom. 1997 vol. 11 pp 1067-1075, by Eng. McCormack, Yates in J. Am. Mass Spectrom. 1994 vol. 5 pp 976-989, by Figeys, Lock et al. (Rapid Commun. in Mass Spectrom. 1998 vol. 12 pp 1435-1444), and by Mortz, O""Connor et al (Proc. Nat. Acid Sci. USA 1996 vol. 93 pp 8264-8267). Alternatively, MS/MS experiments can be carried out on both the original peptide and a derivative of it, and the results from both experiments combined to establish at least a partial sequence without reference to a database. (See, for example, the isotopic labeling method taught by Shevchenko, Chernushevich et al in Rapid Commun. in Mass Spectrom, 1997 vol. 11 pp 1015-24, or the esterification method taught by Yates III, Griffin and Hood in Techniques in Protein Chem. II, ch 46 (1991) pp 477-485), and the H2/D2 exchange method taught by Septov, Issakova et al in Rapid Commun. in Mass Spectrom. 1993 vol. 7 pp 58-62. Johnson and Walsh (Protein Science, 1992 vol. 1 pp 1083-1091) teach a similar method, combining Edman degradation data and MS/MS data.
Of the prior programs which attempt to predict sequence information using only MS/MS data and without reference to existing databases, a variety of methods have been suggested to facilitate the prediction of sequence information. Siegel and Bauman (Biomed. Environ. Mass Spectrom. 1998 vol. 15 pp 333-343) describe an algorithm which builds up the sequence information stepwise from the mass difference between neighbouring ions in ion series recognized in the spectrum, but good results were obtained only with peptides of a few amino acids. Zidarov, Thibault et al. (Biomed. and Environ. Mass Spectrom, 1990 vol. 19 pp 13-26) proposed an algorithm which first attempted to derive the amino acid composition of the peptide from molecular weight and isotopic ratio data, and subsequently to sequence the peptide using a stepwise approach considering all possible sequences for the amino acids so identified. The program SEQPEP (Johnson and Biemann, Biomed and Environ. Mass Spectros. 1989 vol. 18 pp 945-957) identified short sub-sequences of amino acids in a peptide and then extended the sequence outwards from the ends of the sequence, attempting to correlate other peaks in the spectra with more amino acid residues, until the molecular weight of the peptide was reached. Bartels (Biomed. and Environ. Mass Spectrom, 1990 vol. 19 pp 363-368) recognized this search strategy as a problem in graph theory, and the method was further developed by Fernandez-de-Cossio et al (CABIOS, 1995 vol. 11 (4) pp 427-434). These methods calculated a score for trial sequences based on the number of peaks in the experimental spectrum that they fit. Unfortunately, peptides fragment in idiosyncratic fashion, and global scores such as theirs do not perform well. Hines, Falik, et al (J. Am. Soc. Mass Spectrom 1992 vol. 3 pp 326-336) have described a sequencing program which uses pattern recognition techniques to identify groups of peaks in an observed spectrum and subsequently to predict the amino acid sequence. Delgada and Pulfer (J. Chem. Inf. Computer Sci. 1993 vol. 33 pp 332-337) describe a similar pattern recognition algorithm which uses learning machine techniques, also applied to observed spectra. Scarberry, Zhang and Knapp (J. Am. Soc. Mass Spectrom, 1995 vol. 6 pp 936-946) report the application of artificial neural networks to classify the peaks in observed peptide MS/MS spectra followed by sequence determination of the series of peaks so identified.
The following difficulties are inherent in these prior sequencing algorithms. Those that are limited to searching existing databases to identify a peptide or protein will clearly fail if the sequence is in fact unknown at the time. Those that attempt to sequence in a stepwise manner will fail if the spectrum does not contain a significant peak at a mass corresponding to a particular amino acid loss, and the likelihood of this increases rapidly as the number of amino acids comprised in the peptide increases. Those that require the analysis of derivatives of the peptides to resolve ambiguities are clearly less desirable than those which purport to provide the sequence without such derivatives. Those that eliminate groups of possible sequences early on in the sequencing process on the basis of a single test in order to rapidly reduce the number of possibilities to a more manageable level frequently fail to suggest even a low probability for the correct sequence because it has been incorrectly eliminated due to failure of that test. This may arise due to an incorrect assignment of a peak to a series, a smaller than expected peak intensity, or slightly inaccurate mass measurement. Those that require additional information, such as a partial sequence, will fail if that information is in fact incorrect or unavailable. Those that attempt to recognize patterns in the observed data are heavily dependent on a precise understanding of the fragmentation mechanisms which determine the nature of the spectrum, and the complexity of the processes involved is such that universally applicable rules cannot at present be formulated. Thus, the resurrection in GB 2,325,465 of the xe2x80x9cde-novoxe2x80x9d approach of Sakurai et al, Ishikawa, et al and Hamm et al (ibid.) whereby all possible sequences are compared with the observed data without eliminating any possibilities nor relying on a machine interpretation of chemical rules is clearly desirable. However, GB 2,325,465 does not advance the art in practice and merely restates the earlier techniques.
Thus, there is no prior teaching of a xe2x80x9cde-novoxe2x80x9d peptide sequencing method for MS/MS spectra which is capable of handling the data from peptides of more than about ten amino acids. Full searches take too long on the computer typically used to process data generated by the mass spectrometer used to obtain the MS/MS data.
It is an object of the present invention to provide a method of sequencing a peptide either individually or comprised in a mixture of peptides, by tandem mass spectrometry without the use of any additional data concerning the nature of the peptide and without any limit to the number of possible sequences considered. It is a further object to provide such a method which can be implemented on a personal computer typically used for data acquisition on the tandem mass spectrometer, even in the case of peptides comprising 10 or more amino acids. It is another object to provide such a method which does not rely on exhaustive comparison of the spectra predicted from every possible amino acid sequence consistent with any molecular weight constraint, but instead uses mathematical techniques to simulate the effect of such a complete search without actually carrying it out.
In accordance with these objectives the invention provides a method of identifying the most likely amino acid sequences which would account for a mass spectrum obtained from a peptide of unknown sequence, said method comprising the steps of:
a) Producing a processable mass spectrum from said peptide;
b) Choosing a limited number of trial amino-acid sequences which are consistent with a prior probability distribution;
c) Iteratively modifying said trial sequences through a terminated Markov Chain Monte Carlo (MCMC) algorithm to generate new trial sequences, using at each stage modifications which lie within said prior probability distribution, calculating the probability of each of said trial sequences accounting for said processable mass spectrum, and accepting or rejecting each of said trial sequences according to said probability and the mathematical principle of detailed balance.
In preferred methods, the probability of a particular trial sequence accounting for the processable mass spectrum is estimated using Bayes"" theorem. A prior probability is assigned to the sequence and is multiplied by a likelihood factor that reflects the degree of agreement between a spectrum predicted for that sequence and the processable mass spectrum. This process is represented by the equation
Probability (trial sequence AND processable spectrum)=Prior (trial sequence)xc3x97Probability (processable spectrum GIVEN trial sequence)
Conveniently, the term
Prior (trial sequence)
may be determined from the natural (or other) abundance of each of the amino acid residues comprised in the trial sequence. The term
Probability (processable spectrum GIVEN trial sequence)
is the likelihood factor and may be determined using a fragmentation model that sums probabilistically over all the ways in which a trial sequence might fragment and give rise to peaks in the processable mass spectrum.
In one preferred embodiment, the limited number of trial amino-acid sequences chosen in step b) may comprise about 100 members chosen pseudo-randomly from the prior probability distribution. This distribution may comprise sequences based on a library of the 20 most common amino acid residues, but it is within the scope of the invention to include less common or presently unknown residues. The distribution embodies rough preliminary information about the nature of the unknown peptide sample, but its determination may require only minimal information about the sample. For example, it may be sufficient that trial sequences chosen from it are chemically plausible and not of such length that they obviously could not represent the sample. The amino acid composition of the sample, if known, may also suffice. In preferred methods, however, the distribution may be constrained by the approximate molecular weight of the sample, for example within xc2x15 daltons, or most preferably within xc2x10.5 daltons if it is known sufficiently accurately. In general, the more constraints that can be placed on the prior probability distribution, the faster will be the computation and the more tightly constrained will be the most probable sequences for the unknown peptide.
It will be understood that in the initial stages of the process the trial sequences may bear little resemblance to the actual sequence of the unknown peptide. In order to ensure a gentle convergence to the most probable sequences, in further preferred methods the contribution of the likelihood factor to the probability score may be controlled by simulated annealing. Typically, the likelihood factor may be raised to a fractional power which is initially zero and is gradually increased as the algorithm progresses so that the experimental data is given gradually increasing significance.
A further advantage in the use of simulated annealing is that the algorithm employed can indicate when a sufficient number of trial sequences have been tested, so that the generation of trial sequences may be terminated automatically. The simulated annealing algorithm may itself, on the basis of the probabilities assigned to previously tested sequences, determine the fractional power to be currently applied to the likelihood factors of the current trial sequences. Thus in further preferred embodiments of the invention the generation and testing of new trial sequences is continued until the simulated annealing algorithm sets to the correct value (unity) the power to which the likelihood factors are raised.
According to the invention, a Markov Chain Monte Carlo algorithm generates new trial amino-acid sequences. Use of such an algorithm allows the most probable sequences to be identified without the need to test every possible sequence of amino acids that might, for example, account for the observed molecular weight range of the unknown peptide. In order to achieve maximum efficiency, the changes made to the trial sequences should preferably be made in a chemically meaningful manner, rather than purely randomly. Thus, in further preferred embodiments of invention, the Markov Chain Monte Carlo algorithm may modify a trial sequence in at least some, and preferably all, of the following ways:
a) Reversing a contiguous sub-sequence with randomly chosen end points, for example a sequence . . . ARQEIK . . . may be changed to . . . KIEQRA . . .
b) Cycling a contiguous sub-sequence with randomly chosen end points, for example . . . ARQEIK . . . may be changed to . . . QEIKAR . . .
c) Permuting a contiguous sub-sequence with randomly chosen end points, for example a sequence . . . ARQEIK . . . may be changed to . . . IQRKAE . . .
d) Replacing a contiguous sub-sequence with randomly chosen end points with another sub-sequence of approximately the same nominal mass, for example . . . NEQ . . . may be replaced by . . . EKGG . . .
e) Exchanging the C-terminus and N-terminus ends of two sequences to preserve nominal mass, for example the sequences EKGG-DQCYKR and NEH-YKDQCR may be changed to NEH-DQCYKR and EKGG-YKDQCR.
It will be appreciated that this list of possible mutations is not exclusive and many others may be included in the Markov Chain Monte Carlo algorithm. However, to minimize the danger of the algorithm failing to explore all the regions of high probability of the trial sequences accounting for the processable mass spectrum, it is desirable that at least one xe2x80x9cgenetic algorithmxe2x80x9d, as exemplified by the mutation e) above, is included. In accordance with the Markov Chain Monte Carlo method, the choice of which mutations to make to a particular sequence may be determined by a pseudo-random number generator.
In still further preferred methods, a novel fragmentation model, which sums probabilistically over all the ways in which a trial sequence might fragment to give rise to peaks in the processable mass spectrum, is employed. Such a model may be based on the production of at least two series of ions, the b series (which comprises ions representing the N-terminal residue of the trial sequence and the loss of C-terminal amino acid residues), and the yxe2x80x3 series (which comprises ions representing the C-terminal residue and the loss of N-terminal amino acid residues). Each family of ions behaves as a coherent series, with neighbouring ions likely to be either both present or both absent. This behaviour may be described by a Markov chain, in which the probability of an ion being observed is influenced by whether or not its predecessor was observed. The parameters of the chain may be adjusted to take account of the proton affinities of the residues and their physical bond strengths. The fragmentation model may be refined by including other ion series, particularly the a series (b ions which have lost CO), the zxe2x80x3 series (yxe2x80x3 ions which have lost NH3), and the more general loss of NH3 or H2O, again taking account of the probability of the chemical processes involved. Immonium ions equivalent to the loss of CO and H from the various amino acid residues may also be included. Further, the fragmentation model may comprise the generation of sub-sequences of amino acids, that is, sequences that begin and end at amino acid residues internal to the unknown peptide. It will be appreciated that the more realistic is the fragmentation model, the better will be the accuracy and speed of the computation of the most probable sequences. It is theirfore envisaged that different fragmentation models may be employed if advances are made in understanding the chemical mechanism by which the mass spectrum of the petide is produced.
Using Marov chains to model the fragmentation process allows the sum over all the possible fragmention patterns to be calculated in liner time (ie, in a time proportional to the number of animo acid residues in the peptide) rather than in a time proportional to the exponentially large number of fragmentation patterns themselves. This allows the time taken for the prediction of the most probable sequences to be reduced to a practical value (that is, a minute or so), even for peptides of 10 or more amino acids, using a typical personal computer. However, it will be appreciated that the invention is not limited to the particular fragmentation model described above, but includes any probabilistic fragmentation model that can be integrated computationally in polynomial time. The result of applying such a model is a probabilistic likelihood factor
Probability(processable spectrum GIVEN trial sequence)
that can be used in the Markov Chain Monte Carlo algorithm.
Although in certain simple cases the processable mass spectrum may simply be the observed mass spectrum, it is generally preferable to convert the observed spectrum into a more suitable form before attempting to sequence the peptide. Preferably, the processable spectrum is obtained by converting multiply-charged ions and isotopic clusters of ions to a single intensity value at the mass-to-charge ratio corresponding to a singly-charged ion of the lowest mass isotope, and calculating an uncertainty value for the actual mass and the probability that a peak at that mass-to-charge ratio has actually been observed. Conveniently, the uncertainty value of a peak may be based on the standard deviation of a Gaussian peak representing the processed peak and the probability that a peak is actually observed may be related to the signal-to-noise ratio of the peak in the observed spectrum. The program xe2x80x9cMaxEnt3(trademark)xe2x80x9d available from Micromass UK Ltd. may be used to produce the processable spectrum from an observed spectrum.
It will be appreciated that a fragmentation model as described may be used to calculate the probability of any trial sequence of amino acids accounting for a given mass spectrum, irrespective of how that trial sequence has been derived. Viewed from another aspect, therefore, the invention comprises a method of calculating the probability that an experimentally determined mass spectrum of a peptide or a similar molecule may be accounted for by a given sequence of amino acids by the use of a fragmentation model which sums probabilistically over all the ways that said given sequence might fragment. Preferably, the fragmentation model may model the fragmentation of the sequence by means of Markov chains in the manner described above. Also preferably, the experimentally determined mass spectrum is a processable spectrum, obtained in the manner described above. For example, a fragmentation model according to the invention may be used to calculate the probability of amino acid sequences comprised in an existing protein or peptide database accounting for an experimentally observed mass spectrum of a peptide. In this way the peptide, and/or the protein from which it is derived, may be identified. Conveniently, in such a method, only sequences or partial sequences having a molecular weight in a given range are selected from the database for input to the fragmentation model.
In order to carry out the methods of the invention a sample comprising one or more unknown peptides may be introduced into a tandem mass spectrometer and ionized using electrospray ionization. The molecular weights of the unknown peptides may typically be determined by observing the molecular ion groups of peaks in a mass spectrum of the sample. The first analyzer of the tandem mass spectrometer may then be set to transmit the molecular ion group of peaks corresponding to one of the unknown peptides to a collision cell, in which the molecular ions are fragmented by collision with neutral gas molecules. The second mass analyzer of the tandem mass spectrometer may then be used to record an observed fragmentation mass spectrum of the peptide. A processable mass spectrum may then be derived from the observed spectrum using suitable computer software, as explained. If the sample comprises a mixture of peptides, for example as might be produced by a tryptic digest of a protein, further peptides may be analyzed by selecting the appropriate molecular ion group using the first mass analyzer.
Viewed from another aspect the invention provides apparatus for identifying the most likely sequences of amino acids in an unknown peptide, said apparatus comprising a mass spectrometer for generating a mass spectrum of a said unknown peptide and data processing means programmed to:
a) Process data generated by said mass spectrometer to produce a processable mass spectrum;
b) Choose a limited number of trial amino acid sequences that are consistent with a prior probability distribution;
c) Iteratively modify said trial sequences through a terminated Markov Chain Monte Carlo algorithm to generate further trial sequences which are consistent with said prior probability distribution, to calculate the probability of each of said trial sequences accounting for said processable mass spectrum and to accept or reject each of said trial sequences according to said probability and the mathematical principle of detailed balance.
In preferred embodiments, apparatus according to the invention comprises a tandem mass spectrometer, and most preferably a tandem mass spectrometer that comprises a Time-of-Flight mass analyzer at least as its final stage. A Time-of-Flight mass analyzer is preferred because it is generally capable of greater mass measurement accuracy than a quadrupole analyzer. Preferably also the mass spectrometer comprises an electrospray ionization source into which an unknown peptide sample may be introduced.