1. Field
This technology relates to the field of identifying a complex molecule having a structure that includes molecular subunits bound together at cleavage sites through the analysis of fragmentation spectra of the complex molecule.
2. Background
There are a number of techniques that can be used to identify complex molecules. Some of these techniques use fragmentation spectra of the complex molecules. Such fragmentation spectra can be generated by Tandem Mass Spectrometry (“MS/MS”) techniques as is well known in the art. Analysis of the fragmentation spectra can provide clues to the structure and the sequence of molecular subunits that make up the complex molecule.
An ideal spectrum would contain a complete “ladder” of b-ions or y-ions from which we could simply read off the molecular subunit sequence or its reversal. Hence generation of candidate sequences is usually formulated as a longest (or best) path problem in a peak graph, which has a vertex for each peak in the spectrum and an edge connecting two peaks if they differ by the mass of a possible molecular subunit. Due to missed cleavages, peak graphs might use vertices corresponding to small mass ranges rather than peaks, and/or include edges between vertices differing by the mass of a pair of residues.
One way to identify the structure of a complex molecule is to identify and score candidate sequences for the complex molecule in light of the fragmentation spectra. In one approach, the candidates can be found in a database of known molecules. U.S. Pat. No. 5,538,897 to Yates, III et al. (hereby incorporated by reference in its entirety) teaches such a method where the complex molecule is a protein or peptide. The difficulty with this approach is that the complex molecule needs to have been previously identified or predicted and its characteristics stored in the database. Thus, a fragmentation spectrum for a complex molecule that has not been previously characterized and entered into the database will not be identified.
Another way to identify the structure of a complex molecule is to use the de novo sequencing approach. A method for sequencing peptides or proteins taking this approach is disclosed by U.S. Pat. No. 6,582,965 to Townsend et al. (hereby incorporated by reference in its entirety). In the de novo sequencing approach, the method attempts to generate all possible sequences of molecular subunits that could be consistent with the fragmentation spectrum. One problem with the de novo sequencing approach for proteins and peptides is that it is difficult to determine which peaks are y-ions (c-terminus fragments) and which peaks are b-ions (n-terminus fragments). The term “b/y ambiguity” identifies this difficulty. This difficulty is compounded because the fragmentation spectrum includes spectral peaks for noise (ions that are not fragments of the parent molecule) as well as spectral peaks for signals (ions that are fragments of the parent molecule). Noise in the fragmentation spectra decreases the probability of correctly sequencing a complex molecule from fragmentation spectra.
Furthermore, it is very rare for a peptide spectrum to have a complete ladder of b- or y-ions, and hence most sequencers attempt to form a ladder from a mixture of the two types. Lutefisk (see Implementation anduUses of Automated De Novo Peptide Sequencing by Tandem Mass Spectrometry, by Taylor et al. (Anal. Chem., 73 (2001), 2594-2604)) turns each peak into two vertices, one at the observed mass and the other at the complementary mass. Such complementation has two drawbacks: in effect it adds many noise peaks, and it allows the use of a single peak as both a b- and a y-ion. A paper entitled A Dynamic Programming Approach to De Novo Peptide Sequencing by Mass Spectrometry, by Chen et al. (J. Computational Biology, 8 (2001), 325-337) showed how to correct the latter drawback with a longest-path algorithm that disallows the simultaneous use of both vertices from a single peak. A paper entitled An Effective Algorithm for the Peptide De Novo Sequencing from MS/MS Spectrum, by Ma et al. (Symp. Comb. Pattern Matching, 2003, 266-278) teaches a more subtle fix; they allow the simultaneous use of both vertices, but do not score the peak twice. None of these solutions, however, addresses the larger drawback—doubling the number of noise peaks.
A paper entitled New Computational Approaches for De Novo Peptide Sequencing from MS/MS Experiments, by Lubeck et al., (Proc. IEEEV 90, (2002), 1868-1874) proposed the idea of classifying peaks as b-, y-, or “other” prior to running the longest-path algorithm. The longest-path algorithm would use the peak classifications and would avoid complementing every peak. Lubeck did not disclose algorithms or results.
A paper entitled PPM-Chain—De Novo Peptide Identification Program Comparable in Performance to Sequest, by Day et al., (Proc. IEEE Computational Systems Bioinformatics, 2004, 505-508) independently proposed peak classification (as an alternative rather than as an adjunct to the longest-path algorithm), and built a three-class classifier keying on b/y pairs and the “neutral loss neighborhood” (about 30 Daltons) around each peak.
A paper entitled Separation of Ion Types in Tandem Mass Spectrometry Data Interpretation—A Graph-Theoretic Approach, by Yan et al., (Proc. IEEE Computational Systems Bioinformatics, 2004, 236-244) disclosed the same idea and formulated the classification problem as a graph tripartition problem. They disclosed an exponential-time algorithm for an exact solution to the problem.
Thus, it would be advantageous to find a method that is able to classify spectral peaks in the fragmentation spectra as b-ions and y-ions with weights for classification confidence, without increasing the noise in the fragmentation spectra and that uses a less computationally demanding algorithm than Yan et al.