1. Technological Field
The disclosed technology relates to the field of bio-informatics.
2. Background Art
The technology disclosed herein relates to the problems of identifying a macromolecule made up of molecular subunits that are bound at cleavage sites. The identification can be accomplished through the analysis of fragmentation spectra of the macromolecule or of portions of the macromolecule. Such fragmentation spectra can be generated by Tandem Mass Spectrometry (“MS/MS”) techniques as are well known in the art.
One skilled in the art will understand that a tandem mass spectrometer generates a fragmentation spectrum containing dissociation spectrum data by selecting charged molecules (the parent ions) that have approximately the same mass-to-charge-ratio “m/z” (generally within a narrow tolerance) in a first stage of the tandem mass spectrometer, causing the selected parent ions to be fragmented at cleavage sites in a second stage, and accumulating the count of the resulting fragments in m/z histogram bins. A number of these bins can represent a single spectral peak. The height, the area, or a combination of the height and area of the spectral peak can be used to calculate the “intensity” of the spectral peak. The dissociation spectrum data making up the fragmentation spectrum from the tandem mass spectrometer can also include the m/z used at the first stage to select the parent ion. The z for the parent ion m/z is often 2 or 3 (thus requiring additional computational overhead for search techniques that use the parent ion mass); the z for fragments of the parent ion generally is 1, which simplifies the determination of the fragment's mass.
The parent ion's mass along with the dissociation spectrum data can be used by well-known sequencing techniques to identify the parent ion. One skilled in the art will understand that if a molecular fragment is singly ionized the mass represents the real mass of the molecular fragment. If the same molecular fragment is doubly ionized, the m/z for that molecular fragment will be ½ the real mass of the fragment.
By identifying parent ions in a database of molecule descriptions one can select descriptions of macromolecules that contain the parent ions.
However, if the tandem mass spectrometer is operated in a “wide-window” mode (thus allowing molecules having significantly different masses to enter the second stage of the tandem mass spectrometer) the resulting dissociation spectrum data will include contributions from fragments of parent ions having different masses. In addition, the masses of the parent ions will be less accurately known. Thus, prior art molecular sequencing techniques that require a substantially exact mass for the parent ion will fail.
All identification techniques use some amount of de novo processing (which processes the dissociation spectrum data without reference to a database of known macromolecules), followed by some amount of database search that compares information gathered from one or more spectra with entries from a database of molecule descriptions. U.S. Pat. No 5,538,897 to Yates and Eng teaches a nearly pure database search method where the macromolecule is a protein or peptide. Yates computes only a mass for the parent ion from the dissociation spectrum data before referencing the database of molecule descriptions
The ‘sequence tag’ approach of Mann and Wilm (see: Error-Tolerant Identification of Peptides in Sequence Databases by Peptide Sequence Tags, Anal. Chem., 1994, 66, 4390-4399) makes greater use of de novo processing than does Yates. In this approach, one or more short subsequences of molecular subunits are computed from the fragmentation spectrum (for example and in the case of a peptide, a subsequence of three consecutive amino acids) and these ‘sequence tags’ are used to filter entries to find candidates for the parent ion from the database of molecule descriptions. One skilled in the art will understand that candidate entries can be found in the database of molecule descriptions either by a linear search or by an indexed search. The candidate entries found in the database of molecule descriptions can then be scored in detail against the fragmentation spectrum to determine the probability that the entry actually represents the parent ion.
De novo sequencing (see: C. Bartels, Fast algorithm for peptide sequencing by mass spectrometry, Biomedical and Enviromnental Mass Spectrometry 19 (1990), 363--368; and J. Taylor and R. Johnson, Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry, Anal. Chem. 73 (2001), 2594-2605) makes still greater use of de novo processing. It computes one or more hypothetical sequences of molecular subunits that match a fragmentation spectrum. This hypothetical sequence can then be used to filter the database of molecule descriptions, in a style similar to the well-known “BLAST search”, to return descriptions of parent ion candidates from the database of molecule descriptions.
Generally, a method using more de novo processing requires a higher quality fragmentation spectrum than does a method using less de novo processing. In particular, de novo processing works very poorly with mixture spectra, that is, fragmentation spectra resulting from fragments of more than one parent ion. On the other hand, a method using more de novo processing is generally faster, because it returns fewer descriptions of candidates for the parent ion, and is generally more robust to discrepancies between the macromolecules represented by the fragmentation spectrum and the descriptions of known molecules in the database. Discrepancies can include database errors, polymorphic molecules, modified molecules, molecules bound to salt ions, and many other possibilities.
In all three approaches (database search, sequence tag search and de novo sequencing) the database of molecule descriptions is filtered to return descriptions of macromolecules that could represent the parent ion. This reduces the number of candidate descriptions that need to be processed by a computationally expensive scoring procedure.
The mass of a parent ion is a very weak filter for a database of peptides. For ion-trap instruments, the parent mass is typically known to within a range of about 3 Daltons (for more accurate instruments, due to the clustering of peptide masses, this value may still be known only to the closest integer). With a 3-Dalton range, each residue in a peptide has about a 3% chance of completing a peptide that fits the parent ion's mass (because residues average about 100 Daltons). Thus accessing a 1-billion-residue database of peptide descriptions by the mass of the parent ion will return 30 million candidates, each of which needs to be scored. Thus, the processing time available severely limits the complexity of the scorer.
A three letter sequence tag (for example, in a peptide a sequence of three amino acids) is a much stronger filter for a database of peptides than the mass or mass of a parent ion. Each residue in a peptide has about a 0.013% chance of completing a given three-letter tag (about 1 chance in 20 for each of the three letters, so 1/(20*20*20) chance overall). Thus, using a sequence tag as a filter returns 130,000 candidates from the 1-billion-residue database instead of 30 million candidates as returned using the mass of a parent ion as a filter. However, it is difficult to compute a three-letter ‘sequence tag’ (especially if the provided spectrum is of poor quality, or if the provided spectrum is of a mixture of parent ions).
The article (Tang et al. Discovering known and unanticipated protein modifications using MS/MS database searching, Analytical Chem. 77 (2005), 3931--3946) teaches an indexing method using single predicted peaks along with the parent ion mass. This approach does not provide a sufficiently powerful filter for wide-window spectra data acquisition.
There exists a need for a faster, more sensitive and more robust way to select descriptions of candidate parent ion descriptions from a database of molecule descriptions.