In the post-genomics era, the trend towards direct analysis of large numbers of proteins in complex biological samples (proteomics) is developing very quickly and becoming a favored method of augmenting, and in some cases replacing, mRNA expression profiling. The traditional approach to proteomics follows these general steps (i) extract proteins from samples, (ii) separate proteins, (iii) enzymatically digest proteins to produce tryptic fragments, (iv) analyze the resulting peptide mixture with mass spectroscopy (MS) or liquid chromatography (LC)/MS or LC/MS/MS to identify mass and, if possible, sequence of peptides, and (v) identify proteins by comparing identified peptides with a database of hypothetically generated digests.
The scope of a comprehensive characterization of all the proteins in a given sample is overwhelming. For the human genome, which contains approximately 35,000 genes, there are 100,000–500,000 potentially expressed gene products (proteins) due to multiple proteins resulting from each gene via alternate gene splicing and post-translational modifications. In general these never all appear in the same sample, so the absolute complexity of realistic proteomics samples maybe reduced to 30–50,000 proteins in a given sample. This, coupled with the fact that protein identification currently depends on the identification of a number of specific and unique proteolytic digest fragments for each protein (approximately 50 peptides per protein in mammals), further complicates the analysis and leads to potentially hundreds of thousands of peptides that need to be separated and analyzed by mass spectrometry. This daunting sample complexity has led to a a number of strategies that are based on reducing the number of analytes which need to be characterized in a given sample by (i) selectively fractionating the sample to a specific subset of interest or (ii) minimizing the number of proteolytically generated peptides required to identify a specific protein.
In genomic analysis of gene expression profiling, the use of DNA microarrays to quantitatively profile transcription of mRNA is used extensively, with mRNA serving as a surrogate for protein expression. Because of the relative chemical and structural homogeneity of nucleic acids, it is much simpler to develop analytical approaches to look at large number of different sequences simultaneously. Additionally, amplification techniques such as PCR allow extremely sensitive detection. This has lead to the availability of DNA microarrays spanning whole genomes. Furthermore, the use of mRNA ignores the complexities introduced by post-translational modifications of proteins, vastly simplifying the number of analytes to be characterized. In some research contexts, the post-translational modifications can be considered extraneous and uninformative. However, the use of mRNA expression profiling has a number of intrinsic disadvantages. Use of mRNA as a surrogate for protein expression disregards lack of correlation between mRNA and protein concentrations, alternative splicing and mRNA modification, protein post-translational modifications and protein degradation.
The conventional approach to proteomics is the separation of all proteins in a given proteome by two dimensional gel electrophoresis (2DGE), spot excision, digestion and identification of the proteins by MS or MS/MS. This approach has advantages in terms of an extremely high separation power, high sensitivity of MS and well-established technological bases. Sufficient research has also been done to validate methods for dealing with a wide range of samples and biological contexts. Two dimensional gel electrophoresis has practical disadvantages because it is relatively slow, labor intensive and shows poor quantitative performance in terms of reproducibility and linearity. Furthermore, the amount of coverage is limited by instrument capabilities and required MS throughput. However, automation and improved instrument design can potentially overcome these problems. Due to imaging sensitivity and loading capacity of the gel media, there is a more fundamental, intrinsic limitation in 2DGE analysis, resulting in a bias towards identification of the most highly expressed proteins. If sufficient total protein sample is loaded onto a gel to allow sufficient representation of the lowest level expressed proteins, the more highly expressed proteins will precipitate in the gel due to overloading or the signal from the high level proteins will be so high that fainter spots are undetectable. This has resulted in a trend towards alternate technologies.
Multidimensional Liquid Chromatography (MDLC) combined with MS should be considered very similar to 2DGE-MS with two significant differences. The separation power of 2DLC is probably not as high as 2DGE, although improvements in technology may improve this situation. More significant is the fact that MDLC does not face the same biasing effect as discussed above for 2DGE. When combined with techniques for isolating specific proteome fractions, MDLC is a very promising approach; however, in the absence of pre-fractionation, MDLC will suffer from problems of dealing with extremely complex samples, which require a large amount of data analysis to extract information relative to a specific research objective.
Aebersold et al. have developed a method called Isotope Coded Affinity Tags (ICAT) in which samples are derivatized with a cysteine specific reagent which contain a heavy/light form. See, e.g., Gygi et al., Nat. Biotech., 17 (10): 994–99 (1999); PCT Publication No. WO 00/11208 (Aug. 25, 1999); see also U.S. Pat. No. 5,721,099 (Jun. 7, 1995). In ICAT, after tagging, the samples are pooled, proteolytically digested and then the tagged fragments are isolated by a biotin/streptavidin affinity interaction using a biotin functionality that is also part of the ICAT reagent. The resulting peptides are analyzed by LC-MS or matrix assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOFMS) and the relative quantitative expression levels of the two samples can be determined by ratioing the abundance of the heavy and light forms of each peptide. This approach has generated a great deal of interest in the research community and has a number of potential advantages compared to the brute force methods described above. Specifically, the approach leads to a significant simplification of the proteome by only requiring the analysis of cysteine containing peptides. The method also allows accurate relative quantitative characterization by using a control sample as the internal standard for every peptide.
However, ICAT method makes a number of assumptions that may or may not be justified. The first assumption is that every protein includes a cysteine that can be derivatized. In S. cerevisiae, for example, based on genomic sequences, 8% of the proteins do not contain cysteine residues. A second limitation is that information about a single peptide (mass and partial sequence information) is sufficient to identify a protein by comparison with a database. There are some practical limitations as well. Users have reported problems and/or dissatisfaction in the reproducibility, linearity, cost and ease of use with the ICAT approach. Moreover, the ICAT approach does not introducing any selectivity with respect to targeted analysis. Although the technique simplifies the proteome, it does so based on the basis of selecting only peptides with a specific residue rather than a specific characteristic of interest, such as function. Another potential disadvantage of the ICAT approach is that the coupling of the Isotope Coding with the Affinity Tagging limits the flexibility of this technology to adapt to new applications.
Smith et al. have developed a similar technique termed Phosphoprotein Isotope Coded Affinity Tag (PhIAT). See Goshe et al., Anal. Chem., 73: 2578–86 (2001); see also Weckwerth et al., Rapid Commun. Mass Spectrom. 14: 1677–81 (2000). Briefly, PhIAT is a second cousin to the ICAT cysteine labeling reagent, differing from ICAT in that PhIAT is designed to enrich and quantify differences in the O-phosphorylation states of proteins.
Phosphorylation is a major protein post-translational modification, which is involved in the modulation of protein activity and propagation of signals within cellular pathways and networks. Serine, threonine and tyrosine are the hydroxylamino acids that can typically undergo phosphorylation. Lysine, arginine and cysteine can also be phosphorylation but to a much smaller degree. PhIAT does not currently work for tyrosyl phosphorylation. Although 99% of the phosphorylated peptides from the Yeast proteome are serine or threonine modified residues, this does not diminish the importance of tyrosine phosphorylation. As such, it is important to expand PhIAT to include tyrosyl phosphorylation.
A more general approach to expression proteomics has been described by Fenselau et al. in which two 18O labels are introduced universally into the carboxyl termini of each peptide by carrying out proteolytic digestion in 18O enriched water. See Yao et al., Anal. Chem., 73: 2836–42 (2001). In a similar manner to ICAT and PhIAT, the resulting peptides are quantitated by comparison with a control sample digested in normal water. The “heavy” sample will show a 4 amu mass shift over the “light sample.” This is a very attractive and simple approach. Initially, the only major disadvantage compared to the current proposed invention is that a mass difference of >4amu is desirable to avoid interferences with the natural isotope distribution and resolution issues of the doubly charged peptide.
Regnier et al. described an approach termed “Signature Peptides.” See Geng, J Chrom. A., 870: 295–313 (2000); see also U.S. Publication No. U.S. 2002/0037532 A1 (Mar. 28, 2002). The main focus of this was not labeling, though there was mention of acylating primary amino groups with N-acetoxysuccinamide. The “heavy” tag introduced in this case resulted only a 3 amu mass shift. In the description of this approach, it appeared that there was application of this internal standard labeling only to a small subset of peptides; however, how these were selected was not clear. In this approach, peptides that contain only a C-Terminal lysine lost all positive charge and, consequently, had to be analyzed in negative ion mode mass spectrometry (in this case by MALDI-TOFMS). Lack of positive charge would also have some effects on the chromatographic separation characteristics.