DNA sequencing of the human genome has profoundly advanced our understanding of the molecular anatomy of mammalian cells. However, knowing the sequence of all the genes in a cell and extrapolating from this the probable products a cell is capable of producing is not enough. It is clear that i) not all genes are expressed to the same degree; ii) the DNA sequence does not always tell you the structure of a protein in the cases of post-transcriptional and post-translational modifications; iii) knowing the sequence of a gene tells you nothing about the control of expression; iv) control of genetic expression is extremely complicated and can vary from protein to protein; v) post-translational modification can occur without de novo protein biosynthesis; and vi) variables other than genomic DNA can be responsible for disease.
In addition, it has recently become apparent that there is a poor correlation between genetic expression of mRNA, generally measured as cDNA, and the amount of protein expressed by that mRNA. Changes in mRNA concentration are not necessarily proportional to changes in protein concentration. There are even many cases where mRNA will be up regulated and protein concentration will not change at all. The steady state concentration of a protein can depend on the relative degree of expression from multiple genes and the activity of these gene products in the synthesis of a specific protein. Glycoproteins provide a good example. The concentration of a glycoprotein can depend on the level to which the gene coding for the polypeptide backbone is regulated, the presence of all the enzymes responsible for the synthesis and attachment of the oligosaccharide to the polypeptide, and the concentration of glycosidases and proteases that degrade the glycoprotein. For these reasons, analysis of regulation using messenger RNA-based techniques such as “DNA chips” alone is inadequate. It is clear that measuring the concentration of mRNA that codes for the polypeptide backbone may either distort or fail to recognize the total picture of how a protein is regulated. In cases where it is desirable to know how protein expression levels change, direct measurement of those levels may be needed.
Concentration and expression levels of specific proteins vary widely in cells during the life cycle, both in absolute concentration and amount relative to other proteins. Over- or under-expression are known to be indicators of genetic errors, faulty regulation, disease, or a response to drugs. However, the small number of proteins that are up- or down-regulated in response to a particular stimulus are difficult to recognize with current technology. Further, it is frequently difficult to predict which proteins are subject to regulation. The need to examine 20,000 proteins in a cell to find the small number in regulator flux is a formidable problem. The ability to detect only the small numbers of up- or down-regulated proteins in a complex protein milieu would substantially enhance the value of proteomics.
Proteins in complex mixtures are generally detected by some type of fractionation or immunological assay technique. The advantages of immunological assay methods are their sensitivity, specificity for certain structural features of antigens, low cost, and simplicity of execution. Immunological assays are generally restricted to the determination of single protein analytes. This means it is necessary to conduct multiple assays when it is necessary to determine small numbers of proteins in a sample. Hormone-receptor association, enzyme-inhibitor binding, DNA-protein binding and lectin-glycoprotein association are other types of bioaffinity that have been exploited in protein identification, but are not as widely used as immunorecognition. Although not biospecific, immobilized metal affinity chromatography (IMAC) is yet another affinity method that recognizes a specific structural element of polypeptides (J. Porath el al., Nature 258: 598-599 (1975)).
The fractionation approach to protein identification in mixtures is often more lengthy because analytes must be purified sufficiently to allow a detector to recognize specific features of the protein. Properties ranging from chemical reactivity to spectral characteristics and molecular mass have been exploited for detection. Higher degrees of purification are required to eliminate interfering substances as the detection mode becomes less specific. Since no single purification mode can resolve thousands of proteins, multidimensional fractionation procedures must be used with complex mixtures. Ideally, the various separation modes constituting the multidimensional method should be orthogonal in selectivity. The two-dimensional (2D) gel electrophoresis method of O'Farrell (J. Biol. Chem. 250:4007-4021 (1975)) is a good example. The first dimension exploits isoelectric focusing while the second is based on molecular size discrimination. At the limit, 6000 or more proteins can be resolved. 2D gel electrophoresis is now widely used in proteomics where it is the objective to identify thousands of proteins in complex biological extracts.
The most definitive way to identify proteins in gels is by mass spectral analysis of peptides obtained from a tryptic digest of the excised spot. Digestion of an excised spot with trypsin typically generates about 30-200 peptides. Identification is greatly facilitated when peptide molecular mass can be correlated with tryptic cleavage fragments predicted from a genomic database. Computer-assisted mathematical deconvolution algorithms are used to identify a protein based upon its “composite peptide signature.” Proteins can also be identified by their separation characteristics alone in some cases. The advantage of 2D electrophoresis followed by tryptic mapping is that large numbers of proteins can be identified simultaneously. However, the disadvantages of the technique are (1) it is very slow and requires a large number of either manual or robotic manipulations, (2) charged isoforms are resolved whereas uncharged variants in which no new charge is introduced are not, (3) proteins must be soluble to be examined, and (4) quantification by staining is poor.
In addition to being used to identify proteins, 2-D gel electrophoresis has also been used to assess relative changes in protein levels. The degree to which the concentration of a protein changes can be determined by staining the gel and visually observing those spots that changed. Alternatively, changes in the concentration of a protein can be quantitated with a gel scanner. A control 2-D gel is required to determine the concentration of the protein before it was either up or down regulated. Tryptic cleavage of the excised spot and mass analysis using mass spectrometry remains necessary to identify the protein whose expression level has changed.
Promising new techniques are emerging that replace 2-D gel electrophoresis. Most involve some combination of high performance liquid chromatography (HPLC) or capillary electrophoresis (CE) with mass spectrometry to either create a “virtual 2-D gel” or go directly to the peptide level of analysis by tryptic digesting all the proteins in samples as the initial step of analysis. The use of multidimensional chromatography (MDC) to identify proteins in a complex mixture is faster, easier to automate, and couples more readily to MS than 2D gel electrophoresis. One of the more attractive features of chromatographic systems is that they allow many dimensions of analysis to be coupled by analyte transfer between dimensions through automated valve switching. A recent report of an integrated six dimensional analytical system in which serum hemoglobin was purified and sequenced automatically in <2 hours is an example (F. Hsieh et al., Anal. Chem. 68:455 (1996)). Subsequent to purification on an immunoaffinity column, hemoglobin was desorbed into an ion-exchange column for buffer exchange and then tryptic digested by passage through an immobilized trypsin column. Peptides eluting from the immobilized enzyme column were concentrated and desalted on a small, low-surface-area reversed-phase liquid chromatography (RPLC) column and then transferred to an analytical RPLC column where they were separated and introduced into a mass spectrometer through an electrospray interface. Identification at the primary structure level was achieved by a combination of chromatographic properties and multidimensional mass spectrometry of the tryptic peptides. The ability of the immunosorbant to rapidly select the desired analyte for analysis was a great asset to this analysis. Size-exclusion or ion-exchange chromatography coupled to reversed-phase chromatography are other examples of multidimensional systems, albeit of lower selectivity than those using immunosorbant.
Although the methods described above are highly selective and widely used, they have some attributes that limit their efficacy. One is the need for proteins to be soluble before than can be analyzed. This can be a serious limitation in the case of membrane and structural proteins that are sparingly soluble. A second is that it is desirable or even necessary in some cases for the protein analyte to be of native structure during at least part of the analysis. This is a limitation because it restricts the sample preparation protocol. Native macromolecular structures are notoriously more difficult to analyze than small molecules. The necessity for post separation protcolysis, as in the 2D gel approach, is another limitation. Large numbers of fractions must be subjected to a 24 hour tryptic digestion protocol in the analysis of a single sample when many proteins are being identified. The tryptic digestion step is necessary because the mass of intact proteins is far less useful in searching DNA databases than that of peptides derived from the protein. And finally, pure proteins are a prerequisite for antibody preparation in all the immunorecognition methods. The preparation of antibodies to an antigen is lengthy, laborious, and costly, and many antigens have never been purified. This is particularly true of proteins predicted by genomic data alone. Purification is complicated by the fact that one does not know the degree to which a protein is expressed, whether it is part of a multisubunit complex, or if it is post translationally modified.
Additionally, there is the issue of quantification. Measuring either the relative abundance of proteins or changes in protein concentration remains a major challenge in proteomics. Improved methods for protein identification, quantification and detection of regulatory (or relative change) or proteins, especially for the identification and quantification of proteins within a complex mixture, are clearly needed to advance the new science of proteomics.