1. Field of the Invention
The present invention relates to chemical analysis systems. More particularly, it relates to systems that are useful for the analysis of complex mixtures of molecules, including large organic molecules such as proteins, environmental pollutants, and petrochemical compounds, to methods of analysis used therein, and to a computer program product having computer code embodied therein for causing a computer, or a computer and a mass spectrometer in combination, to affect such analysis. Still more particularly, it relates to such systems that have mass spectrometer portions.
2. Prior Art
The race to map the human genome in the past several years has created a new scientific field and industry named genomics, which studies DNA sequences to search for genes and gene mutations that are responsible for genetic diseases through their expressions in messenger RNAs (mRNA) and the subsequent coding of peptides which give rise to proteins. It has been well established in the field that, while the genes are at the root of many diseases including many forms of cancers, the proteins to which these genes translate are the ones that carry out the real biological functions. The identification and quantification of these proteins and their interactions thus serve as the key to the understanding of disease states and the development of new therapeutics. It is therefore not surprising to see the rapid shift in both the commercial investment and academic research from genes (genomics) to proteins (proteomics), after the successful completion of the human genome project and the identification of some 35,000 human genes in the summer of 2000. Different from genomics, which has a more definable end for each species, proteomics is much more open-ended as any change in gene expression level, environmental factors, and protein-protein interactions can contribute to protein variations. In addition, the genetic makeup of an individual is relatively stable whereas the protein expressions can be much more dynamic depending on various disease states and many other factors. In this “post genomics era,” the challenges are to analyze the complex proteins (i.e., the proteome) expressed by an organism in tissues, cells, or other biological samples to aid in the understanding of the complex cellular pathways, networks, and “modules” under various physiological conditions. The identification and quantitation of the proteins expressed in both normal and diseased states plays a critical role in the discovery of biomarkers or target proteins.
The challenges presented by the fast-developing field of proteomics have brought an impressive array of highly sophisticated scientific instrumentation to bear, from sample preparation, sample separation, imaging, isotope labeling, to mass spectral detection. Large data arrays of higher and higher dimensions are being routinely generated in both industry and academia around the world in the race to reap the fruits of genomics and proteomics. Due to the complexities and the sheer number of proteins (easily reaching into thousands) typically involved in proteomics studies, complicated, lengthy, and painstaking physical separations are performed in order to identify and sometime quantify individual proteins in a complex sample. These physical separations create tremendous challenges for sample handling and information tracking, not to mention the days, weeks, and even months it typically takes to fully elucidate the content of a single sample.
While there are only about 35,000 genes in the human genome, there are an estimated 500,000 to 2,000,000 proteins in human proteome that could be studied both for general population and for individuals under treatment or other clinical conditions. A typical sample taken from cells, blood, or urine, for example, usually contains up to several thousand different proteins in vastly different abundances. Over the past decade, the industry has popularized a process that includes multiple stages in order to analyze the many proteins existing in a sample. This process is summarized in Table 1 with the following notable features:
TABLE 1A Typical Proteomics Process: Time, Cost, and Informatics NeedsStepsProteomics ProcessSampleIsolate proteins from biological samplescollectionsuch as blood, tissue, urine, etc.Instrument cost: minimal; Time: 1-3 hoursMostly liquid phase sampleNeed to track sample source/preparationconditionsGelSeparate proteins spatially through gelseparationelectrophoresis to generate up to severalthousand protein spotsInstrument cost: $150K; Time: 24 hoursLiquid into solid phaseNeed to track protein separationconditions and gel calibration informationImagingImage, analyze, identify protein spotsandon the gel with MW/pI calibration, and spotspot cuttingcutting.Instrument cost: $150K; Time: 30 sec/spotSolid phaseTrack protein spot images, image processingparameters, gel calibration parameters,molecular weights (MW) and pI's, andcutting recordsProteinChemically break down proteins into peptidesdigestionInstrument cost: $50K; Time: 3 hoursSolid to liquid phaseTrack digestion chemistry &reaction conditionsProteinMix each digested sample with mass spectralSpottingmatrix, spot on sample targets, and dryor Sample(MALDI) or sample preparation for LC/MS(/MS)preparationInstrument cost: $50K; Time: 30 sec/spotLiquid to solid phaseTrack volumes &concentrations forsamples/reagentsMass spectralMeasure peptide(s) in each gel spot directlyanalysis(MALDI) or via LC/MS(/MS)Instrument: $200K-650K; Time: 1-10sec/spot on MALDI or 30 min/spot on LC/MS(/MS)Solid phase on MALDI or liquid phase onLC/MS(/MS)Track mass spectrometer operation,analysis, and peak processing parametersProteinSearch private/public protein databases todatabase searchidentify proteins based on unique peptidesInstrument cost: minimal; Time: 1-60 sec/spotSummaryInstrument cost: $600K-$1 MTime/sample: several days minimal    a. It could take up to several days or weeks or even months to complete the analysis of a single sample.    b. The bulky hardware system costs $600,000 to $1M with significant operating (labor and consumables), maintenance, and lab space cost associated with it.    c. This is an extremely tedious and complex process that includes several different robots and a few different types of instruments to essentially separate one liquid sample into hundreds to thousands of individual solid spots, each of which needs to be analyzed one-at-a-time through another cycle of solid-liquid-solid chemical processing.    d. It is not a small challenge to integrate these pieces/steps together for a rapidly changing industry, and as a result, there is not yet a commercial system that fully integrates and automates all these steps. Consequently, this process is fraught with human as well as machine errors.    e. This process also calls for sample and data tracking from all the steps along the way—not a small challenge even for today's informatics.    f. Even for a fully automated process with a complete sample and data tracking informatics system, it is not clear how these data ought to be managed, navigated, and most importantly, analyzed.    g. At this early stage of proteomics, many researchers are content with qualitative identification of proteins. The holy grail of proteomics is, however, both identification and quantification, which would open doors to exciting applications not only in the area of biomarker identification for the purpose of drug discovery but also for clinical diagnostics, as evidenced by the intense interest generated from a recent publication (Pertricoin, E. F. III et al., Lancet, Vol. 359, pp. 573-77, (2002)) on using protein profiles from blood samples for ovarian cancer diagnostics. The current process cannot be easily adapted for quantitative analysis due to the protein loss, sample contamination, or lack of gel solubility, although attempts have been made for quantitative proteomics with the use of complex chemical processes such as ICAT (isotope-coded affinity tags); a general approach to quantitation wherein proteins or protein digests from two different sample sources are labeled by a pair of isotope atoms, and subsequently mixed in one mass spectrometry analysis (Gygi, S. P. et al. Nat. Biotechnol. 17, 994-999 (1999)).
Isotope-coded affinity tags (ICAT) is a commercialized version of the approach introduced recently by the Applied Biosystems of Foster City, Calif. In this technique, proteins from two different cell pools are labeled with regular reagent (light) and deuterium substituted reagent (heavy), and combined into one mixture. After trypsin digestion, the combined digest mixtures are subjected to the separation by biotin-affinity chromatography to result in a cysteine-containing peptide mixture. This mixture is further separated by reverse phase HPLC and analyzed by data dependent mass spectrometry followed by database search.
This method significantly simplifies a complex peptide mixture into a cysteine-containing peptide mixture and allows simultaneous protein identification by SEQUEST database search and quantitation by the ratio of light peptides to heavy peptides. Similar to LC/LC/MS/MS, ICAT also circumvents insolubility problem, since both techniques digest whole protein mixture into peptide fragments before separation and analysis.
While very powerful, ICAT technique requires a multi-step process for labeling and pre-separation process, resulting in the loss of low abundant proteins with added reagent cost and further reducing the throughput for the already slow proteomic analysis. Since only cysteine-containing peptides are analyzed, the sequence coverage is typically quite low with ICAT. As is the case in typical LC/MS/MS experiment, the protein identification is achieved through the limited number of MS/MS analysis on hopefully signature peptides, resulting in only one and at most a few labeled peptides for ratio quantitation.
Liquid chromatography interfaced with tandem mass spectrometry (LC/MS/MS) has become a method of choice for protein sequencing (Yates Jr. et al., Anal. Chem. 67, 1426-1436 (1995)). This method involves a few processes including digestion of proteins, LC separation of peptide mixtures generated from the protein digests, MS/MS analysis of resulted peptides, and database search for protein identification. The key to effectively identify proteins with LC/MS/MS is to produce as many high quality MS/MS spectra as possible to allow for reliable matching during database search. This is achieved by a data-dependent scanning technique in a quadrupole or an ion trap instrument. With this technique, the mass spectrometer checks the intensities and signal to noise ratios of the most abundant ion(s) in a full scan MS spectrum and perform MS/MS experiments when the intensities and signal to noise ratios of the most abundant ions exceed a preset threshold. Usually the three most abundant ions are selected for the product ion scans to maximize the sequence information and minimize the time required, as the selection of more than three ions for MS/MS experiments would possibly result in missing other qualified peptides currently eluting from the LC to the mass spectrometer.
The success of LC/MS/MS for identification of proteins is largely due to its many outstanding analytical characteristics. Firstly, it is a quite robust technique with excellent reproducibility. It has been demonstrated that it is reliable for high throughput LC/MS/MS analysis for protein identification. Secondly, when using nanospray ionization, the technique delivers quality MS/MS spectra of peptides at sub-fentamole levels. Thirdly, the MS/MS spectra carry sequence information of both C-terminal and N-terminal ions. This valuable information can be used not only for identification of proteins, but also for pinpointing what post translational modifications (PTM) have occurred to the protein and at which amino acid reside the PTM take place.
For the total protein digest from an organism, a cell line, or a tissue type, LC/MS/MS alone is not sufficient to produce enough number of good quality MS/MS spectra for the identification of the proteins. Therefore, LC/MS/MS is usually employed to analyze digests of a single protein or a simple mixture of proteins, such as the proteins separated by two dimensional electrophoresis (2DE), adding a minimum of a few days to the total analysis time, to the instrument and equipment cost, and to the complexity of sample handling and the informatics need for sample tracking. While a full MS scan can and typically do contain rich information about the sample, the current LC/MS/MS methodology relies on the MS/MS analysis that can be afforded for only a few ions in the full MS scan. Moreover, electrospray ionization (ESI) used in LC/MS/MS has less tolerance towards salt concentrations from the sample, requiring rigorous sample clean up steps.
Identification of the proteins in an organism, a cell line, and a tissue type is an extremely challenging task, due to the sheer number of proteins in these systems (estimated at thousands or tens of thousands). The development of LC/LC/MS/MS technology (Link, A. J. et al. Nat. Biotechnol. 17, 676-682 (1999); Washburn, M. P. et al, Nat. Biotechnol. 19, 242-247 (2001)) is one attempt to meet this challenge by going after one extra dimension of LC separation. This approach begins with the digestion of the whole protein mixture and employs a strong cation exchange (SCX) LC to separate protein digests by a stepped gradient of salt concentrations. This separation usually takes 10-20 steps to turn an extremely complex protein mixture into a relatively simplified mixture. The mixtures eluted from the SCX column are further introduced into a reverse phase LC and subsequently analyzed by mass spectrometry. This method has been demonstrated to identify a large number of proteins from yeast and the microsome of human myeloid leukemia cells.
One of the obvious advantages of this technique is that it avoids insolubility problems in 2DE, as all the proteins are digested into peptide fragments which are usually much more soluble than proteins. As a result, more proteins can be detected and wider dynamic range achieved with LC/LC/MS/MS. Another advantage is that chromatographic resolution increases tremendously through the extensive 2D LC separation so that more high quality MS/MS spectra of peptides can be generated for more complete and reliable protein identification. The third advantage is that this approach is readily automated within the framework of current LC/MS system for potentially high throughput proteomic analysis.
The extensive 2D LC separation in LC/LC/MS/Ms, however, could take 1-2 days to complete. In addition, this technique alone is not able to provide quantitative information of the proteins identified and a quantitative scheme such as ICAT would require extra time and effort with sample loss and extra complications. In spite of the extensive 2D LC separation, there are still a significant number of peptide ions not selected for MS/MS experiments due to the time constraint between the MS/MS data acquisition and the continuous LC elution, resulting in low sequence coverage (25% coverage is considered as very good already). While recent development in depositing LC traces onto a solid support for later MS/MS analysis can potentially address the limited MS/MS coverage issue, it would introduce significantly more sample handling and protein loss and further complicate the sample tracking and information management tasks.
Matrix-Assisted Laser Desorption Ionization (MALDI) utilizes a focused laser beam to irradiate the target sample that is co-crystalized with a matrix compound on a conductive sample plate. The ionized molecules are usually detected by a time of flight (TOF) mass spectrometer, due to their shared characteristics as pulsed techniques.
MALDI/TOF is commonly used to detect 2DE separated intact proteins because of its excellent speed, high sensitivity, wide mass range, high resolution, and contaminant-forgivingness. MALDI/TOF with capabilities of delay extraction and reflecting ion optics can achieve impressive mass accuracy at 1-10 ppm and mass resolution with m/Δm at 10000-15000 for the accurate analysis of peptides. However, the lack of MS/IS capability in MALDI/TOF is one of the major limitations for its use in proteomics applications. Post Source Decay (PSD) in MALDI/TOF does generate sequence-like MS/MS information for peptides, but the operation of PSD often is not as robust as that of a triple quadrupole or an ion trap mass spectrometer. Furthermore, PSD data acquisition is difficult to automate as it can be peptide-dependent.
The newly developed MALDI TOF/TOF system (Rejtar, T. et al., J. Proteomr. Res. 1(2) 171-179 (2002)) delivers many attractive features. The system consists of two TOFs and a collision cell, which is similar to the configuration of a tandem quadrupole system. The first TOF is used to select precursor ions that undergo collisional induced dissociation (CID) in the cell to generate fragment ions. Subsequently, the fragment ions are detected by the second TOF. One of the attractive features is that TOF/TOF is able to perform as many data dependent MS/MS experiments as necessary, while a typical LC/MS/MS system selects only a few abundant ions for the experiments. This unique development makes it possible for TOF/TOF to perform industry scale proteomic analysis. The proposed solution is to collect fractions from 2D LC experiments and spot the fractions onto an MALDI plate for MS/MS. As a result, more MS/MS spectra can be acquired for more reliable protein identification by database search as the quality of MS/MS spectra generated by high-energy CID in TOF/TOF is far better than PSD spectra.
The major drawback for this approach is the high cost of the instrument ($750,000), the lengthy 2D separations, the sample handling complexities with LC fractions, the cumbersome sample preparation processes for MALDI, the intrinsic difficulty in quantification with MALDI, and the huge informatics challenges for data and sample tracking. Due to the LC separation and the sample preparation time required, the analysis of several hundred proteins in one sample would take at least 2 days.
It is well recognized that Fourier-Transform Ion-Cyclotron Resonance (FTICR) MS is a powerful technique that can deliver high sensitivity, high mass resolution, wide mass range, and high mass accuracy. Recently, FTICR/MS coupled with LC showed impressive capabilities for proteomic analysis through Accurate Mass Tags (AMT) (Smith, R. D. et al, Proteomics, 2, 513-523 (2002)). AMT is such an accurate m/z value of a peptide that can be used to exclusively identify a protein. It has been demonstrated that, using the AMT approach, a single LC/FTICR-MS analysis can potentially identify more than 105 proteins with mass accuracy of better than 1 ppm. Nonetheless, ATM alone may not be sufficient to pinpoint amino acid residue specific post-translational modifications of peptides. In addition, the instrument is prohibitively expensive at a cost of $750K or more with high maintenance requirements.
Protein arrays and protein chips are emerging technologies (Issaq, H. J. et al, Biochem Biophys Res Commun. 292(3), 587-592 (2002)) similar in the design concept to the oligonucleotide-chip used in gene expression profiling. Protein arrays consist of protein chips which contain chemically (cationic, anionic, hydrophobic, hydrophilic, etc.) or biochemically (antibody, receptor, DNA, etc.) treated surfaces for specific interaction with the proteins of interest. These technologies take advantages of the specificity provided by affinity chemistry and the high sensitivity of MADLI/TOF and offer high throughput detection of proteins. In a typical protein array experiment, a large number of protein samples can be simultaneously applied to an array of chips treated with specific surface chemistries. By washing away undesired chemical and biomolecular background, the proteins of interest are docked on the chips due to affinity capturing and hence “purified”. Further analysis of individual chip by MALDI-TOF results in the protein profiles in the samples. These technologies are ideal for the investigation of protein-protein interactions, since proteins can be used as affinity reagents to treat the surface to monitor their interaction with other specific proteins. Another useful application of these technologies is to generate comparative patterns between normal and diseased tissue samples as a potential tool for disease diagnostics.
Due to the complicated surface chemistries involved and the added complications with proteins or other protein-like binding agents such as denaturing, folding, and solubility issues, protein arrays and chips are not expected to have as wide an application as gene chips or gene expression arrays.
Thus, the past 100 years have witnessed tremendous strides made on the MS instrumentation with many different types of instruments designed and built for high throughput, high resolution, and high sensitivity work. The instrumentation has been developed to a stage where single ion detection can be routinely accomplished on most commercial MS systems with unit mass resolution allowing for the observation of ion fragments coming from different isotopes. In stark contrast to the sophistication in hardware, very little has been done to systematically and effectively analyze the massive amount of MS data generated by modern MS instrumentation.
In a typical mass spectrometer, the user is usually required or supplied with a standard material having several fragment ions covering the mass spectral m/z range of interest. Subject to baseline effects, isotope interferences, mass resolution, and resolution dependence on mass, peak positions of a few ion fragments are determined either in terms of centroids or peak maxima through a low order polynomial fit at the peak top. These peak positions are then fit to the known peak positions for these ions through either 1st or other higher order polynomial fit to calibrate the mass (m/z) axis.
After the mass axis calibration, a typical mass spectral data trace would then be subjected to peak analysis where peaks (ions) are identified. This peak detection routine is a highly empirical and compounded process where peak shoulders, noise in data trace, baselines due to chemical backgrounds or contamination, isotope peak interferences, etc., are considered.
For the peaks identified, a process called centroiding is typically applied to attempt to calculate the integrated peak areas and peak positions. Due to the many interfering factors outlined above and the intrinsic difficulties in determining peak areas in the presence of other peaks and/or baselines, this is a process plagued by many adjustable parameters that can make an isotope peak appear or disappear with no objective measures of the centroiding quality.
Thus, despite their apparent sophistication current approaches have several pronounced disadvantages. These include:
Lack of Mass Accuracy. The mass calibration currently in use usually does not provide better than 0.1 amu (m/z unit) in mass determination accuracy on a conventional MS system with unit mass resolution (ability to visualize the presence or absence of a significant isotope peak).
In order to achieve higher mass accuracy and reduce ambiguity in molecular fingerprinting such as peptide mapping for protein identification, one has to switch to an MS system with higher resolution such as quadrupole TOF (qTOF) or FT ICR MS which come at significantly higher cost.
Large Peak Integration Error. Due to the contribution of mass spectral peak shape, its variability, the isotope peaks, the baseline and other background signals, and the random noise, current peak area integration has large errors (both systematic and random errors) for either strong or weak mass spectral peaks.
Difficulties with Isotope Peaks. Current approach does not have a good way to separate the contributions from various isotopes which usually give out partially overlapped mass spectral peaks on conventional MS systems with unit mass resolution. The empirical approaches used either ignore the contributions from neighboring isotope peaks or over-estimate them, resulting in errors for dominating isotope peaks and large biases for weak isotope peaks or even complete ignorance of the weaker peaks. When ions of multiple charges are concerned, the situation becomes worse even, due to the now reduced separation in mass unit between neighboring isotope peaks.
Nonlinear Operation. The current approaches use a multi-stage disjointed process with many empirically adjustable parameters during each stage. Systematic errors (biases) are generated at each stage and propagated down to the later stages in an uncontrolled, unpredictable, and nonlinear manner, making it impossible for the algorithms to report meaningful statistics as measures of data processing quality and reliability.
Dominating Systematic Errors. In most of MS applications, ranging from industrial process control and environmental monitoring to protein identification or biomarker discovery, instrument sensitivity or detection limit has always been a focus and great efforts have been made in many instrument systems to minimize measurement error or noise contribution in the signal. Unfortunately, the peak processing approaches currently in use create a source of systematic error even larger than the random noise in the raw data, thus becoming the limiting factor in instrument sensitivity or reliability.
Mathematical and Statistical Inconsistency. The many empirical approaches used currently make the whole mass spectral peak processing inconsistent either mathematically or statistically. The peak processing results can change dramatically on slightly different data without any random noise or on the same synthetic data with slightly different noise. In order words, the results of the peak processing are not robust and can be unstable depending on the particular experiment or data collection.
Instrument-To-Instrument Variations. It has usually been difficult to directly compare raw mass spectral data from different MS instruments due to variations in the mechanical, electromagnetic, or environmental tolerances. With the current ad hoc peak processing applied on the raw data, it only adds to the difficulty of quantitatively comparing results from different MS instruments. On the other hand, the need for comparing either raw mass spectral data directly or peak processing results from different instruments or different types of instruments has been increasingly heightened for the purpose of impurity detection or protein identification through the searches in established MS libraries.
A second order instrument generates a matrix of data for each sample and can have a higher analytical power than first order instruments if the data matrix is properly structured. The most widely used proteomics instrument, LC/MS, is a typical example of second order instrument capable of potentially much higher analytical power than what is currently achieved. Other second order proteomics instruments include LC/LC with single UV wavelength detection, 1D gel with MALDI-TOF MS detection, 1D protein arrays with MALDI MS detection, etc.
Two-dimensional gel electrophoresis (2D gel) has been widely used in the separation of proteins in complex biological samples such as cells or urines. Typically the spots formed by the proteins are stained with silver for easy identification with visible imaging systems. These spots are subsequently excised, dissolved/digested with enzymes, transported onto MALDI targets, dried, and analyzed for peptide signatures using MALDI time-of-flight mass spectrometer.
Several complications arise from this process:    1. The protein spots are not guaranteed to contain only single proteins, especially at extreme ends of the separation parameters (pI for charge or MW for molecular weight). This usually makes peptide searching difficult if not impossible. Additional liquid chromatography separation may be required for each excised spot, which further slows down the analysis.    2. The conversion of biological sample from liquid phase to solid phase (on the gel), back into liquid phase (for digestion), and finally into solid phase again (for MALDI TOF analysis) is a very cumbersome process prone to errors, carry-overs, and contaminations.    3. Due to the sample conversion processes involved and the fact the MALDI-TOF irreproducibility in sampling and ionization, this analysis has been widely recognized as only qualitative and not quantitative.
Thus, in spite of its tremendous potential and clear advantages over first and zeroth order analysis, second order instrument and analysis have so far been limited to academia research where the sample is composed of a few synthetic analytes with no sign of commercialization. There are several barriers that must be crossed in order for this approach to reach its huge potential. These include:    a. In second order protein analysis, it is even more important to use raw profile MS scans instead of the centroid data currently used in virtually all MS applications. To maintain the bilinear data structure, successive MS scans of a particular ion eluting from LC needs to have the same mass spectral peak shape (obviously at different peak heights), a critical second order structure destroyed by centroiding and de-isotoping (summing all isotope peaks into one integrated area).
The sticks from centroiding data appear at different mass locations (up to 0.5 amu error) from successive MS scans of the same ion.    b. Higher order instrument and analysis requires more robust instrument and measurement process and artifacts such as shifts in one or two of the dimensions can severely compromise the quantitative and even the qualitative results of the analysis (Wang, Y. et al, Anal. Chem. 63, 2750 (1991); Wang, Y. et al, Anal. Chem., 65, 1174 (1993); Kiers, H. A. L. et al, J. Chemometrics 13, 275 (1999)), in spite of the recent progress made in academia (Bro, R. et al, J. Chemometrics 13, 295 (1999)). Other artifacts such as non-linearity or non-bilinearity could also lead to complications (Wang, Y. et al, J. Chemometrics, 7, 439 (1993)). Standardization and algorithmic corrections need to be developed in order to maintain the bilinearity of second order proteomics data.    c. In many MS instruments such as quadrupole MS, the mass spectral scan time is not negligible compared to the protein or peptide elution time. Therefore, a significant skew would exist where the ions measured in one mass spectral scan comes from different time points during the LC elution, similar to what has been reported for GC/MS (Stein, S. E. et al, J. Am. Soc. Mass Spectrom. 5, 859 (1994)).
Thus, there exists a significant gap between where the proteomics research would like to be and where it is at the present.