This application claims the priority benefit under 35 U.S.C. xc2xa7119(a)-(d) of Great Britain Patent Application No. GB 9710582.9, filed on May 22, 1997.
The invention relates to a method for the determination of the precise linear sequence of amino acids in a peptide, polypeptide, or protein without recourse or reference to either a known pre-defined data base or to sequential amino acid residue analysis. As such, the method of the invention is a true, de novo peptide sequence determination method.
The composition of a peptide, polypeptide, or protein as a sequence of amino acids is well understood. Each peptide, polypeptide, and protein is uniquely defined by a precise linear sequence of amino acids. Knowledge of the precise linear arrangement or sequence of amino acids in a peptide, polypeptide, or protein is required for various purposes, including DNA cloning in which the sequence of amino acids provides information required for oligonucleotide probes and polymerase chain reaction (xe2x80x9cPCRxe2x80x9d) primers. Knowledge of the exact sequence also allows the synthesis of peptides for antibody production, provides identification of peptides, polypeptides, and proteins, aids in the characterization of recombinant products, and is useful in the study of post-translational modifications.
A variety of sequencing methods are available to obtain the amino acid sequence information. For example, a series of chemical reactions, e.g., Edman reactions, or enzymatic reactions, e.g., exo-peptidase reactions, are used to prepare sequential fragments of the unknown peptide. Either an analysis of the sequential fragments or a sequential analysis of the removed amino acids is used to determine the linear amino acid sequence of the unknown peptide. Typically, the Edman degradation chemistry is used in modern automated protein sequencers.
In the Edman degradation, a peptide, polypeptide, or protein is sequenced by degradation from the N-terminus using the Edman reagent, phenylisothiocyanate (xe2x80x9cPITCxe2x80x9d). The degradation process involves three steps, i.e., coupling, cleavage, and conversion. In the coupling step, PITC modifies the N-terminal residue of the peptide, polypeptide, or protein. An acid cleavage then cleaves the N-terminal amino acid in the form of an unstable anilinothiazolinone (xe2x80x9cATZxe2x80x9d) derivative, and leaves the peptide, polypeptide, or protein with a reactive N-terminus and shortened by one amino acid. The ATZ derivative is converted to a stable phenylthiohydantoin in the conversion step for identification, typically with reverse phase high performance liquid chromatography (xe2x80x9cRP-HPLCxe2x80x9d). The shortened peptide, polypeptide, or protein is left with a free N-terminus that can undergo another cycle of the degradation reaction. Repetition of the cycle results in the sequential identification of each amino acid in the peptide, polypeptide, or protein. Because of the sequential nature of amino acid release, only one molecular substance can be sequenced at a time. Therefore, peptide, polypeptide, or protein samples must be extremely pure for accurate and efficient sequencing. Typically, samples must be purified with HPLC or SDS-PAGE techniques.
Although many peptide, polypeptide, and protein sequences have been determined by Edman degradation, currently, most peptide, polypeptide, and protein sequences are deduced from DNA sequences determined from the corresponding gene or cDNA. However, the determination of a protein sequence using a DNA sequencing technique requires knowledge of the specific nucleotide sequence used to synthesize the protein. DNA sequencing cannot be used where the nature of the protein or the specific DNA sequence used to synthesize the protein is unknown.
A peptide, polypeptide, or protein sequence may also be determined from experimental fragmentation spectra of the unknown peptide, polypeptide, or protein, typically obtained using activation or collision-induced fragmentation in a mass spectrometer. Tandem mass spectrometry (xe2x80x9cMS/MSxe2x80x9d) techniques have been particularly useful. In MS/MS, a peptide is first purified, and then injected into a first mass spectrometer. This first mass spectrometer serves as a selection device, and selects a target peptide of a particular molecular mass from a mixture of peptides and polypeptides or proteins, and eliminates most contaminants from the analysis. The target molecule is then activated or fragmented to form a mixture from the target or parent peptide of various peptides of a lower mass that are fragments of the parent. The mixture is then selected through a second mass spectrometer (i.e. step), generating a fragment spectrum.
Typically, in the past, the analysis of fragmentation spectra to determine peptide sequences has involved hypothesizing one or more amino acid sequences based on the fragmentation spectrum. In certain favorable cases, an expert researcher can interpret the fragmentation spectra to determine the linear amino acid sequence of an unknown peptide. The candidate sequences may then be compared with known amino acid sequences in protein sequence libraries.
In one strategy, the mass of each amino acid is subtracted from the molecular mass of the parent peptide to determine the possible molecular mass of a fragment, assuming that each amino acid is in a terminal position. The experimental fragment spectrum is then examined to determine if a fragment with such a mass is present. A score is generated for each amino acid, and the scores are sorted to generate a list of partial sequences for the next subtraction cycle. The subtraction cycle is repeated until subtraction of the mass of an amino acid leaves a difference of between xe2x88x920.5 and 0.5, resulting in one or more candidate amino acid sequences. The highest scoring candidate sequences are then compared to sequences in a library of known protein sequences in an attempt to identify a protein having a sub-sequence similar or identical to the candidate sequence that generated the fragment spectrum.
Although useful in certain contexts, there are difficulties related to hypothesizing candidate amino acid sequences based on fragmentation spectra. The interpretation of fragmentation spectra is time consuming, can generally be performed only in a few laboratories that have extensive experience with mass spectrometry, and is highly technical and often inaccurate. Human interpretation is relatively slow, and may be highly subjective. Moreover, methods based on peptide mass mapping are limited to peptide masses derived from an intact homogeneous peptide, polypeptide, or protein generated by specific, known proteolytic cleavage, and, thus, are not applicable in general to a mixture of peptides, polypeptides, or proteins.
U.S. Pat. No. 5,538,897 to Yates, III et al. provides a method of correlating the fragmentation spectrum of an unknown peptide with theoretical spectra calculated from described peptide sequences stored in a database to match the amino acid sequence of the unknown peptide to that of a described peptide. Known amino acid sequences, e.g., in a protein sequence library, are used to calculate or predict one or more candidate fragment spectra. The predicted fragment spectra are then compared with the experimentally-obtained fragment spectrum of the unknown protein to determine the best match or matches. Preferably, the mass of the unknown peptide is known. Sub-sequences of the various sequences in the protein sequence library are analyzed to identify those sub-sequences corresponding to a peptide having a mass equal to or within a given tolerance of the mass of the parent peptide in the fragmentation spectrum. For each sub-sequence having the proper mass, a predicted fragment spectrum can be calculated by calculating masses of various amino acid subsets of the candidate peptide. As a result, a plurality of candidate peptides, each having predicted fragment spectrum, is obtained. The predicted fragment spectra are then compared with the fragment spectrum obtained experimentally for the unknown protein to identify one or more proteins having sub-sequences that are likely to be identical to the sequence of peptides that resulted in the experimentally-derived fragment spectrum. However, this technique cannot be used to derive the sequence of unknown, novel proteins or peptides having no sequence or sub-sequence identity with those pre-described or contained in such databases, and, thus, is not a de novo sequencing method.
Therefore, there remains a need for a true de novo sequencing method of determining the amino acid sequence of a peptide using mass spectrometry.
The present invention is directed to a method for generating a library of peptides, wherein each peptide in the library has a molecular mass corresponding to the same predetermined molecular mass. Typically, the library of peptides is then used to determine the amino acid sequence of an unknown peptide having the predetermined molecular mass. Preferably, the predetermined molecular mass used to generate the library is the molecular mass of the unknown peptide. Most preferably, the molecular mass of the unknown peptide is determined prior to the generation of the library using a mass spectrometer, such as a time-of-flight mass spectrometer.
The library is synthetic, i.e., not pre-described, and is typically generated each time a peptide is analyzed, based on the predetermined molecular mass of the unknown peptide. The library is generated by defining a set of all allowed combinations of amino acids that can be present in the unknown peptide, where the molecular mass of each combination corresponds to the predetermined molecular mass within the experimental accuracy of the device used to determine the molecular mass, allowing for water lost in peptide bond formation and for protonation, and generating an allowed library of all possible permutations of the linear sequence of amino acids in each combination in the set.
Generally, the present invention is directed to a method for determining the amino acid sequence of an unknown peptide, which comprises determining a molecular mass and an experimental fragmentation spectrum for the unknown peptide, comparing the experimental fragmentation spectrum of the unknown peptide to theoretical fragmentation spectra calculated for each individual member of an allowed synthetic peptide library, where the allowed peptide library is of the type described above, and identifying a peptide in the peptide library having a theoretical fragmentation spectrum that matches most closely the fragmentation spectrum of the unknown peptide, from which it is inferred that the amino acid sequence of the identified peptide in the allowed library represents the amino acid sequence of the unknown peptide.
The molecular mass for the unknown peptide may be determined by any means known in the art, but is preferably determined with a mass spectrometer. Allowed combinations of amino acids are chosen from a set of allowed amino acids that typically comprises the natural amino acids, i.e., tryptophan, arginine, histidine, glutamic acid, glutamine, aspartic acid, leucine, threonine, proline, alanine, tyrosine, phenylalanine, methionine, lysine, asparagine, isoleucine, cysteine, valine, serine, and glycine, but may also include other amino acids, including, but not limited to, non-natural amino acids and chemically modified derivatives of the natural amino acids, e.g., carbamidocysteine and deoxymethionine. Allowed combinations of amino acids are then calculated using one or more individual members of this set of amino acids, allowing for known mass changes associated with peptide bond formation, such that the total mass of each allowed combination corresponds to the predetermined mass of the unknown peptide to within the experimental accuracy to which this molecular mass of the unknown peptide was calculated, typically about 30 ppm. The set of allowed combinations is most easily calculated using an appropriately programmed computer. The allowed peptide library is assembled by permutation in all possible linear combinations of each allowed amino acid composition, and is also most easily constructed using an appropriately programmed computer. It should be noted that the term xe2x80x9callowedxe2x80x9d with respect to amino acid combinations and libraries of peptides refers to combinations and libraries specific to the unknown peptide under investigation. The peptide library is constructed from the amino acid combinations, which in turn are calculated from the experimentally determined molecular mass. As unknown peptides of different mass are investigated, so different combinations of amino acids are allowed, and hence each unknown peptide of unique molecular mass gives rise to a unique peptide library.
The nature of the fragmentation process from which the theoretical fragmentation spectrum is calculated for every peptide in the allowed library may be of any type known in the art, such as a mass spectrum or a protease or chemical fragmentation spectrum. Preferably, both the molecular mass and the fragmentation spectrum for the unknown peptide are obtained from a tandem mass spectrometer. The immonium ion region of the mass spectrum used to determine the molecular mass may also be used to identify amino acids contained in the unknown peptide. The identity of these amino acids is then used to constrain the allowed library. The amino acid sequence of the peptide from the allowed library of peptides, having a calculated fragmentation spectrum that best fits the experimental fragmentation spectrum of the unknown peptide, corresponds to the amino acid sequence of the unknown peptide.
Although not required, the experimental fragmentation spectrum is generally normalized. A factor that is an indication of closeness-of-fit between the experimental fragmentation spectrum of the unknown peptide, polypeptide, or protein and each of the theoretical fragmentation spectra calculated for the peptide library may then be calculated to determine which of the theoretical fragmentation spectra best fits the experimental fragmentation spectrum. Preferably, peak values in the fragmentation spectra having an intensity greater than a predetermined threshold value are selected when calculating the indication of closeness-of-fit. The theoretical fragmentation spectrum that best fits the experimental fragmentation spectrum corresponds to the amino acid sequence in the allowed library that matches that of the unknown peptide, polypeptide, or protein.