Genomic technology has advanced to a point at which, in principle, it has become possible to determine complete genomic sequences and to quantitatively measure the mRNA levels for each gene expressed in a cell. For some species the complete genomic sequence has now been determined, and for one strain of the yeast Saccharomyces cerevisiae, the mRNA levels for each expressed gene have been precisely quantified under different growth conditions (Velculescu et al., Cell 88:243–251 (1997)). Comparative cDNA array analysis and related technologies have been used to determine induced changes in gene expression at the mRNA level by concurrently monitoring the expression level of a large number of genes (in some cases all the genes) expressed by the investigated cell or tissue (Shalon et al., Genome Res 6:639–645 (1996)). Furthermore, biological and computational techniques have been used to correlate specific function with gene sequences. The interpretation of the data obtained by these techniques in the context of the structure, control and mechanism of biological systems has been recognized as a considerable challenge. In particular, it has been extremely difficult to explain the mechanism of biological processes by genomic analysis alone.
Proteins are essential for the control and execution of virtually every biological process. The rate of synthesis and the half-life of proteins and thus their expression level are also controlled post-transcriptionally. Furthermore, the activity of proteins is frequently modulated by post-translational modifications, in particular protein phosphorylation, and dependent on the association of the protein with other molecules including DNA and proteins. Neither the level of expression nor the state of activity of proteins is therefore directly apparent from the gene sequence or even the expression level of the corresponding mRNA transcript. It is therefore essential that a complete description of a biological system include measurements that indicate the identity, quantity and the state of activity of the proteins which constitute the system. The large-scale (ultimately global) analysis of proteins expressed in a cell or tissue has been termed proteome analysis (Pennington et al., Trends Cell Bio 7:168–173 (1997)).
At present no protein analytical technology approaches the throughput and level of automation of genomic technology. The most common implementation of proteome analysis is based on the separation of complex protein samples most commonly by two-dimensional gel electrophoresis (2DE) and the subsequent sequential identification of the separated protein species (Ducret et al., Prot Sci 7:706–719 (1998); Garrels et al., Electrophoresis 18:1347–1360 (1997); Link et al., Electrophoresis 18:1314–1334 (1997); Shevchenko et al., Proc Natl Acad Sci USA 93:14440–14445 (1996); Gygi et al., Electrophoresis 20:310–319 (1999); Boucherie et al., Electrophoresis 17:1683–1699 (1996)). This approach has been assisted by the development of powerful mass spectrometric techniques and the development of computer algorithms which correlate protein and peptide mass spectral data with sequence databases and thus rapidly identify proteins (Eng et al., J Am Soc Mass Spectrom 5:976–980 (1994); Mann and Wilm, Anal Chem 66:4390–4399 (1994); Yates et al., Anal Chem 67:1426–1436 (1995)). This technology (two-dimensional mass spectrometry) has reached a level of sensitivity which now permits the identification of essentially any protein which is detectable by conventional protein staining methods including silver staining (Figeys and Aebersold, Electrophoresis 19:885–892 (1998); Figeys et al., Nature Biotech 14:1579–1583 (1996); Figeys et al., Anal Chem 69:3153–3160 (1997); Shevchenko et al., Anal Chem 68:850–858 (1996)). However, the sequential manner in which samples are processed limits the sample throughput, the most sensitive methods have been difficult to automate and low abundance proteins, such as regulatory proteins, escape detection without prior enrichment, thus effectively limiting the dynamic range of the technique. In the 2DE/(MS)n method, proteins are quantified by densitometry of stained spots in the 2DE gels.
The development of methods and instrumentation for automated, data-dependent electrospray ionization (ESI) tandem mass spectrometry (MS)n in conjunction with microcapillary liquid chromatography (μLC) and database searching has significantly increased the sensitivity and speed of the identification of gel-separated proteins. As an alternative to the 2DE/(MS)n approach to proteome analysis, the direct analysis by tandem mass spectrometry of peptide mixtures generated by the digestion of complex protein mixtures has been proposed (Dongr'e et al., Trends Biotechnol 15:418–425 (1997)). μLC-MS/MS has also been used successfully for the large-scale identification of individual proteins directly from mixtures without gel electrophoretic separation (Link et al., Nat Biotech, 17:676–682 (1999); Opitek et al., Anal Chem 69:1518–1524 (1997)). While these approaches accelerate protein identification, the quantities of the analyzed proteins cannot be easily determined, and these methods have not been shown to substantially alleviate the dynamic range problem also encountered by the 2DE/(MS)n approach. Therefore, low abundance proteins in complex samples are also difficult to analyze by the μLC/MS/MS method without their prior enrichment.
It is therefore apparent that current technologies, while suitable to identify a portion of the components of protein mixtures, are neither capable of measuring the quantity nor the state of activity of the protein in a mixture. Even improvements of the current approaches are unlikely to advance their performance sufficiently to make routine quantitative and functional proteome analysis a reality.
This invention provides methods and reagents that can be employed in proteome analysis which overcome the limitations inherent in traditional techniques The basic approach described can be employed for the quantitative analysis of protein expression in complex samples (such as cells, tissues, and fractions thereof), the detection and quantitation of specific proteins in complex samples, and the quantitative measurement of specific enzymatic activities in complex samples.
In this regard, a multitude of analytical techniques are presently available for clinical and diagnostic assays which detect the presence, absence, deficiency or excess of a protein or protein function associable with a normal or disease state. While these techniques are quite sensitive, they do not necessarily provide chemical separation of products and may, as a result, be difficult to use for assaying several proteins or enzymes simultaneously in a single sample. Current methods may not distinguish among aberrant expression of different enzymes or their malfunctions which lead to a common set of clinical symptoms. The methods and reagents herein can be employed in clinical and diagnostic assays for simultaneously (multiplex) monitoring of multiple proteins and protein reactions.
Complex mixtures of proteins give rise to even more complex mixtures of peptides after proteolytic digestion. One way to reduce this complexity is to label a particular amino acid and then enrich for only those peptides containing the labeled amino acid. One good example of a selective peptide label is the use of iodoacetamido functional groups to specifically react with cysteine residues. Approximately 85–90% of all proteins contain at least one cysteine residue, which makes the labeling method applicable to almost all proteins present in a complex mixture. We have designed trifunctional synthetic peptide based reagents that can be used for reducing the complexity of peptide mixtures by labeling peptides with iodoacetamido groups and then selectively enriching only those peptides containing labeled cysteine residues.