Most protein identifications today are performed by matching spectra against databases using programs like SEQUEST or MASCOT. While these tools are invaluable, they are already too slow for matching large MS/MS datasets against large protein databases. Recent progress in mass spectrometry instrumentation (a single LTQ-FT mass-spectrometer can generate 100,000 spectra per day) may soon make them obsolete. Since SEQUEST compares every spectrum against every database peptide, it will take a cluster of about 60 processors to analyze the spectra produced by a single such instrument in real time (if searching through the Swiss-Prot database). If one attempts to perform a time-consuming search for post-translational modifications, the running time may further increase by orders of magnitude. New solutions are needed to deal with the stream of data produced by shotgun proteomics projects. Algorithms have recently been developed that prune (X!Tandem) and filter (InsPecT) (see Tanner et al., Anal Chem., 77(14):4626-39, Jul. 15, 2005, (incorporated herein by reference)) databases to speed-up the search. However, these tools still require comparison of every spectrum against the smaller database.
Moreover, the common assumption that all proteins of interest are present in the database is often refuted by the limited availability of sequenced genomes and multiple mechanisms of protein variation. Well known mechanisms of protein diversity include variable recombination and somatic hypermutation of immunoglobulin genes. The vital importance of some of these novel proteins is directly reflected by the success of monoclonal antibody drugs such as Rituxan™, Herceptin™ and Avastin™, all derived from proteins that are not directly inscribed in any genome. Similarly, multiple commercial drugs have been developed from proteins obtained from species whose genomes are not known. In particular, peptides and proteins isolated from venom have provided essential clues for drug design—examples include drugs for controlling blood coagulation and drugs for breast and ovarian cancer treatment. Even so, the genomes of the venomous snakes, scorpions, and snails are unlikely to become available anytime soon. Despite this vital importance of novel proteins, the mainstream method for protein sequencing is still the restrictive and low-throughput Edman degradation—a task made difficult by protein purification procedures, post-translational modifications and blocked protein N-termini. These problems gain additional relevance when one considers the unusually high level of variability and post-translational modifications in venom proteins. The primary function of venom is to immobilize prey and prey animals vary in their susceptibility to venom. As a result, venom composition within snake species shows considerable geographical variation, an important consideration because snake bites (even by snakes of the same species) may require different treatments. Moreover, the amount and number of different proteins and isoforms varies with gender, diet, etc.
Mass spectrometry provides detailed information about the molecules being analyzed, including high mass accuracy. It is also a process that can be easily automated. However, high-resolution MS alone fails to perform against unknown or bioengineered agents, or in environments where there is a high background level of bioagents (“cluttered” background). Low-resolution MS can fail to detect some known agents, if their spectral lines are sufficiently weak or sufficiently close to those from other peptides in the sample. DNA chips with specific probes can only determine the presence or absence of specifically anticipated peptides. Because there are hundreds of thousands of species of benign bacteria, some very similar in sequence to threat organisms, even arrays with 10,000 probes lack the breadth needed to detect a particular organism.
Antibodies face more severe diversity limitations than arrays. If antibodies are designed against highly conserved targets to increase diversity, the false alarm problem will dominate, again because threat organisms are very similar to benign ones. Antibodies are only capable of detecting known agents in relatively uncluttered environments.
Reports have described detection of PCR products using high resolution electrospray ionization—Fourier transform—ion cyclotron resonance mass spectrometry (ESI-FT-ICR MS). Accurate measurement of exact mass combined with knowledge of the number of at least one nucleotide allowed calculation of the total base composition for PCR duplex products of approximately 100 base pairs. (Aaserud et al., J. Am. Soc. Mass Spec. 7:1266-1269, 1996; Muddiman et al., Anal. Chem. 69:1543-1549, 1997; Wunschel et al., Anal. Chem. 70:1203-1207, 1998; Muddiman et al., Rev. Anal. Chem. 17:1-68, 1998). Electrospray ionization-Fourier transform-ion cyclotron resistance (ESI-FT-ICR) MS has been used to determine the mass of double-stranded, 500 base-pair PCR products via the average molecular mass (Hurst et al., Rapid Commun. Mass Spec. 10:377-382, 1996). The use of matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry for characterization of PCR products has been described. (Muddiman et al., Rapid Commun. Mass Spec. 13:1201-1204, 1999). However, the degradation of DNAs over about 75 nucleotides observed with MALDI limited the utility of this method.
U.S. Pat. No. 5,849,492 describes a method for retrieval of phylogenetically informative DNA sequences which comprise searching for a highly divergent segment of genomic DNA surrounded by two highly conserved segments, designing the universal primers for PCR amplification of the highly divergent region, amplifying the genomic DNA by PCR technique using universal primers, and then sequencing the gene to determine the identity of the organism.
U.S. Pat. No. 5,965,363 discloses methods for screening nucleic acids for polymorphisms by analyzing amplified target nucleic acids using mass spectrometric techniques and to procedures for improving mass resolution and mass accuracy of these methods.
WO 99/14375 describes methods, PCR primers and kits for use in analyzing preselected DNA tandem nucleotide repeat alleles by mass spectrometry.
WO 98/12355 discloses methods of determining the mass of a target nucleic acid by mass spectrometric analysis, by cleaving the target nucleic acid to reduce its length, making the target single-stranded and using MS to determine the mass of the single-stranded shortened target. Also disclosed are methods of preparing a double-stranded target nucleic acid for MS analysis comprising amplification of the target nucleic acid, binding one of the strands to a solid support, releasing the second strand and then releasing the first strand which is then analyzed by MS. Kits for target nucleic acid preparation are also provided.
PCT WO97/33000 discloses methods for detecting mutations in a target nucleic acid by non-randomly fragmenting the target into a set of single-stranded nonrandom length fragments and determining their masses by MS.
U.S. Pat. Nos. 5,547,835, 5,605,798, 6,043,031, 6,197,498, 6,221,601, 6,221,605, 6,277,573, 6,235,478, 6,258,538, 6,300,076, 6,428,955 and 6,500,621, describe fast and highly accurate mass spectrometer-based processes for detecting the presence of a particular nucleic acid in a biological sample for diagnostic purposes.
WO 98/20166 describes processes for determining the sequence of a particular target nucleic acid by mass spectrometry. Processes for detecting a target nucleic acid present in a biological sample by PCR amplification and mass spectrometry detection are disclosed, as are methods for detecting a target nucleic acid in a sample by amplifying the target with primers that contain restriction sites and tags, extending and cleaving the amplified nucleic acid, and detecting the presence of extended product, wherein the presence of a DNA fragment of a mass different from wild-type is indicative of a mutation. Each of the publications and patent documents cited herein is incorporated herein by reference.
One algorithmic approach recognized the conserved regions of genomic space. Regions of variability flanked these conserved regions. Although the nucleotide sequence of the variable region was unknown, the understanding of the conserved regions, together with the absolute limitation on nucleotide options (A,C,T,G,U) simplified the list of potential sequences, based on molecular weight.
Each of the foregoing require substantial understanding of the peptide of concern. In many cases specific PCR primers and/or molecular tags are required. It is clear there is a need for an algorithmic method for identifying a peptide without the foregoing limitations.