This invention relates generally to proteomics and, more specifically to de novo sequencing of polypeptides using mass spectrometry.
Proteomics can be described as the study of proteins expressed by a given cellular state, and like genomics, it is a global rather than a hypothesis driven science. Questions for study are not asked in series, such as which protein causes a given biological activity or effect, but rather in parallel, such as how do all of the expressed proteins in a given cell describe that cell. The use of mass spectrometry in proteomic studies has been employed as part of a global comparison of proteins that seeks to define the proteins characteristic of a state or to determine differences between states. An example would be the comparison of proteomes from cancerous versus normal cells with the intent of discovering a protein or proteins that are associated with cancer.
Mass spectrometry methods have been employed as a descriptive science to catalogue or compare proteins that represent a given cellular condition. Additionally, mass spectrometric methods have also been employed for determining the relative abundance of proteins expressed between two different biological samples. These methods allow the changes in protein expression between cells in different conditions or environments to be studied on a global scale so that information on protein expression can be gathered on multiple proteins in a single experiment. Assessing the relative abundance of proteins between different conditions has been based on differential mass labeling of proteins with stable isotopes either in vitro or in vivo. Mass spectrometry data from these experiments can also be used to search protein databases in hopes of identifying proteins within the sample. However, additional information about the samples, such as the correct sequence of proteins within the sample, is not available.
Numerous drawbacks exist which hinder the accuracy or efficiency of sequence identification using database searching. For example, protein identity can not be determined for proteins whose sequence is not in a database, for example, because the genome from which the protein is derived might not have been sequenced yet. In addition, the increasing complexity of these databases can lead to several possible protein identifications for each polypeptide fragment making it difficult to determine the true protein identity with confidence. Furthermore, database searching is limited in that this method can not accurately detect mutations or post-translational modifications in proteins. Almost all protein sequences are post-translationally modified, and as many as 200 types of covalent modifications of amino acid residues are known. Post-translational modifications of proteins are often important for biological activity.
Mass spectrometry has been used to determine the amino acid sequence of proteins of interest without searching a database through a method called de novo sequencing. In this method, the difference in mass of mass spectrometry peaks is correlated to the mass of amino acids that make up the polypeptide sequence. One limitation of mass spectrometry de novo sequencing methods is that the mass spectrometry data needs to be of high quality so that polypeptide mass spectrometry signals can be distinguished over non-peptide signals. High thorough-put proteomics experiments, and experiments determining the relative mass of polypeptides between two samples, have not generated mass spectrometry data of sufficient quality for de novo sequence determination. Also instruments with this capability are currently available in only a few laboratories since they are expensive and need highly skilled operators. Another limitation of mass spectrometry de novo sequencing methods is that polypeptides must be labeled in such a way that directionality can be assigned to the sequence. It is important to know whether a given fragment ion results from charge retention on the amino- or carboxyl-terminus in order to determine orientation of the sequence.
Thus, there exists a need for efficient and reliable de novo sequencing from mass spectrometry data. The present invention satisfies this need and provides related advantages as well.
The invention provides a method of determining an amino acid sequence of a parent polypeptide. The method consists of (a) obtaining mass spectra of two or more differentially labeled polypeptide fragments of a parent polypeptide; (b) assigning a mass and a weighting characteristic to two or more paired signals having a difference in mass corresponding to an integer value of said differential label, the weighting characteristic combining properties of each signal within said paired signals; (c) selecting from the mass spectra a paired signal having the assigned mass and a weighting characteristic distinguishable from non-peptide signals, the assigned mass indicating the mass of a polypeptide fragment within the spectra; (d) determining the difference in mass of the polypeptide fragments; (e) assigning the mass differences a satisfying amino acid name, and (f) orienting the assigned amino acid names. Also provided is a method of determining the amino acid sequence of a polypeptide. The method consists of: (a) constructing a graph from mass spectra of two or more differentially labeled polypeptides, the graph comprising a node with mass m, number of labels n, intensity i, and mass differential of labels xcex4; (b) creating a node corresponding to a paired signal having masses of about m and about m+nxcex4, and (c) adding a labeled weighted directed edge to the graph between any two nodes corresponding to a mass of an amino acid, the labeled weighted directed edge combining properties of the paired signals.