A major goal of preventive medicine is to diagnose diseases in which early intervention can alter the outcome of the disease. One example of such a disease is pancreatic cancer, a neoplasm which kills 24 000 people in the United States every year, and whose incidence has increased threefold in the last forty years (Brand 2001). Pancreatic cancer is highly fatal with only 10% of patients surviving one year past the time of diagnosis (Brand, 2001). An exception to this high mortality rate for pancreatic cancer is among patients in whom the cancer is detected early; in rare cases small carcinomas at the head of the pancreas are associated with hepatobiliary obstruction and painless jaundice, prompting radiographic studies that identify the cancer at an early, resectable stage (Brand, 2001). Thus, early detection of pancreatic cancer can have a major impact on the prognosis of the disease. Importantly, early detection of other cancers, such as cervical cancer by the Pap smear, and prostate cancer, with the PSA blood test, have had a substantial impact on the mortality of these diseases. Thus, early detection is a key strategy in the fight against cancer.
However, early detection does not have an impact solely on cancer. Other diseases, such as Parkinson's disease, can be ameliorated with early intervention. Patients with Parkinson's disease show delayed progression following treatment with monoamine oxidase inhibitors (Koller, 1997). Conceivably, early intervention of Parkinson's disease with this class of drugs can delay the onset of the characteristically debilitating tremor, muscle atrophy, and neurodegeneration associated with this disease.
Ten years ago, a major international effort was initiated to sequence the entire human genome, which amounts to 3.2 million basepairs. By mid-2000, a draft of the human genome was announced, comprising a majority of the coding sequences. With the completion of the sequencing of the genomes of yeast (Mewes et al., 1997), Caenorhabditis elegans (The C. elegans Sequencing Consortium, 1998), and Drosphilia (Adams et al., 2001), and the anticipation of the complete human genome being fully sequenced by 2003, there is no doubt that this vast set of knowledge will have a substantial impact on the diagnosis and treatment of human disease.
In particular, the sequencing of the human genome provides data that can be used to identify markers for early stages of disease, e.g., neoplastic and cardiovascular diseases. Once the identity of every protein in the human proteome is known, the levels of each protein can be individually queried in a systematic manner, and compared from blood samples of patients with the disease, and unafflicted patients. Epidemiological studies comparing large numbers of afflicted patients with unafflicted patients can lead to the identification of proteins whose expression level is increased or decreased only in the case of disease. These proteins can then be detected in routine blood assays in the general population, much in the same way that Caucasian men over 50 years and African-American men over 40 years take a yearly PSA blood test to screen for prostate cancer.
A number of methods have been developed to take advantage of the newly generated genomic data, most notably “gene-chip” technology, that provide a way to simultaneously quantitate transcript levels of numerous genes. While this method is useful, for most experimental purposes, quantitation of protein levels is more relevant than quantitation of transcript levels. Indeed, in many well-documented cases, changes in mRNA levels do not correlate linearly with changes in protein levels (Gygi et al., 1999; Anderson et al., 1997); moreover, the levels of some proteins are regulated at the level of protein stability. For example, the cell cycle is crucially dependent on the stability of cyclins, which are regulated by cell-stage-specific ubiquitination. Additionally, some samples, such as serum, which are particularly useful because they are easily collected from patients for diagnostic purposes, are composed of proteins whose transcripts reside in a variety of diverse tissues. Unfortunately, technology to assess protein levels is remarkably underdeveloped compared to DNA technologies.
The principle technique to study protein levels involves subjecting different samples to two-dimensional gel electrophoresis (2DE) and comparing the pattern of spots that arise following staining. The identities of spots with different intensities can be determined by excising the spot, trypsinizing the eluted protein, and subjecting the peptides to MALDI/MS. One major limitation of this approach is that only proteins that are excised from the gels are examined, leaving the identities of non-excised spots unknown. Another limitation is that many proteins, especially those which are highly charged or have transmembrane segments, resolve poorly on 2DE gels. Most importantly, since only a relatively small amount of protein can be loaded onto the gels, most proteins are present at levels below the level of detection by MALDI/MS.
Mass spectrometry (MS) is a technique that is capable of generating molecular weight data for peptides and other small molecules, and tandem mass spectrometry (MS/MS) is a technique that provides sequence information for peptides. Because full-length proteins do not produce highly accurate data by MS, samples are first digested with proteases, such as trypsin, and the resulting peptide mixture is subjected to MS and MS/MS. Using data from the completed genome sequencing projects, molecular masses and sequence information obtained by MS and MS/MS can be used to identify parent proteins that were present in the original samples.
Although an approach which employs MS is useful, the complexity of the peptide mixture increases as the number of proteins in the starting material is increased. For example, digestion of the yeast proteome, which comprises 6,000 proteins, results in 344,000 peptides. Peaks seen by MS analysis of these complex mixtures often cannot be interpreted because they represent numerous different peptides with similar molecular masses. Additionally, low abundance peptides are often invisible or lie within or adjacent to peaks of higher abundance. Typically, a liquid chromatography-electrospray mass spectrometer can resolve approximately 30,000 peptides (Gygi et al., 1999). Thus, because of the large number of peptides that would be generated, MS analysis does not appear to be feasible for higher eukaryotes using current technology.
Thus, there is a need for a method to simplify complex mixtures so that proteins in complex mixtures can be readily identified.