1. Field of the Invention
The present invention generally relates to data processing in the field of data mining and, more particularly, to methods, systems, and computer program products for mining mass spectral data for further analysis.
2. Description of the Background
Mass spectrometry (MS) instruments generate and analyze ions from chemical substances. These analyses yield mass spectra, which reflect the chemical nature of the substances analyzed. MS instruments can generate full-scan mass spectra, which represent all ions generated from chemical substances entering the MS instrument at any particular point in time. MS instruments can also generate tandem mass spectra (MS—MS spectra) by a process in which specific ions are selected (precursor ions) and then subjected to energetic dissociation, which produces fragment ions (product ions). The MS—MS spectrum records the distribution of product ions produced from a specific precursor ion and specific structural features of the precursor species can be deduced from this information. Modern MS instruments are capable of automated acquisition of large numbers of full-scan mass spectra or MS—MS spectra. The automated, high-throughput evaluation of these spectra represents a significant challenge to the utilization of data generated by MS instruments.
Application of modern MS techniques for protein and peptide analysis have made feasible the large-scale analysis of cellular proteomes, which comprise the collection of all proteins in an organism or any subset thereof. Protein components of even highly complex proteomes have been identified by digestion of the proteins to peptides, followed by MS analysis of the peptides. A widely used MS analysis is liquid chromatography coupled to tandem MS (LC-MS—MS) with triple quadrupole, quadrupole-ion trap, quadrupole-time of flight or tandem time of flight MS instruments, which provide useful information in the form of collision-induced dissociation (CID) spectra for peptides. Peptide precursor ions subjected to CID undergo fragmentation to yield product ions, which are recorded in the MS—MS spectra. These spectra contain signals for a variety of product ions, including y-ions, b-ions and related species arising from fragmentation of the peptide backbone. In addition, these MS—MS spectra contain signals indicating the presence and sequence location of peptide modifications.
Identification of peptide sequences from MS—MS spectra may be done by direct interpretation (de novo sequence analysis). Once a peptide sequence has been determined, the source protein may be identified by comparing the peptide sequence to a database of protein sequences. However, typical LC-MS-MS analyses generate hundreds to thousands of MS—MS spectra. The sheer volume of data thus precludes proteome analysis involving de novo sequence interpretation.
Yates, III et al (U.S. Pat. No. 5,538,897) implemented a computer program to correlate MS—MS data with protein and nucleotide sequences stored in databases. This program correlates MS—MS spectra with database sequences that match the measured mass of the peptide precursor ion. This program thus obviates de novo sequence interpretation and greatly speeds protein identification from MS—MS data.
However, a major problem in proteome analysis is the heterogeneity of proteins due to numerous posttranslational modifications, splice variants, gene polymorphisms and mutations. Indeed, any gene may give rise to multiple protein products. Although the program of Yates, III et al can allow for the presence of certain anticipated modifications, the unpredictable and diverse nature of protein modifications often yields peptides of different masses than those in sequence databases. These unanticipated protein modifications prevent correct protein identifications by this program. These circumstances illustrate the need for data evaluation tools that can detect MS—MS data that correspond to variant peptide forms.
The general problem of detecting and characterizing unanticipated peptide variants remains a significant barrier to comprehensive characterization of complex peptide mixtures.