This invention relates generally to proteome analysis and more specifically to methods for transferring functional groups to molecules in a sample for analysis and quantitation of the molecules.
The classical biochemical approach to study biological processes has been based on the purification to homogeneity by sequential fractionation and assay cycles of the specific activities that constitute a process, the detailed structural, functional and regulatory analysis of each isolated component, and the reconstitution of the process from the isolated components. The Human Genome Project and other genome sequencing programs are turning out in rapid succession the complete genome sequences of specific species and, thus, in principle the amino acid sequence of every protein potentially encoded by that species. It is to be expected that this information resource unprecedented in the history of biology will enhance traditional research methods and catalyze progress in fundamentally different research paradigms, one of which is proteomics.
Efforts to sequence the entire human genome along with the genomes of a number of other species have been extraordinarily successful. The genomes of 46 microbial species (TIGR Microbial Database; www.tigr.org) have been completed and the genomes of over one hundred twenty other microbial species are in the process of being sequenced. Additionally, the more complex genomes of eukaryotes, in particular those of the genetically well characterized unicellular organism Saccharomyces cerevisiae and the multicellular species Caenorhabditis elegans and Drosophila melanogaster have been sequenced completely. Furthermore, “draft sequence” of the rice, human and Arabidopsis genomes have been published. Even in the absence of complete genomic sequences, rich DNA sequence databases have been made publicly available, including those containing over 2.1 million human and over 1.2 million murine expressed sequence tags (ESTs).
ESTs are stretches of approximately 300 to 500 contiguous nucleotides representing partial gene sequences that are being generated by systematic single pass sequencing of the clones in cDNA libraries. On the timescale of most biological processes, with the notable exception of evolution, the genomic DNA sequence can be viewed as static, and a genomic sequence database therefore represents an information resource akin to a library. Intensive efforts are underway to assign “function” to individual sequences in sequence databases. This is attempted by the computational analysis of linear sequence motifs or higher order structural motifs that indicate a statistically significant similarity of a sequence to a family of sequences with known function, or by other means such as comparison of homologous protein functions across species. Other methods have also been used to determine function of individual sequences, including experimental methods such as gene knockouts and suppression of gene expression using antisense nucleotide technology, which can be time consuming and in some cases still insufficient to allow assignment of a biological function to a polypeptide encoded by the sequence.
The proteome has been defined as the protein complement expressed by a genome. This somewhat restrictive definition implies a static nature of the proteome. In reality the proteome is highly dynamic since the types of expressed proteins, their abundance, state of modification, subcellular locations, and interactions with other biomolecules such as polypeptides and nucleic acids are dependent on the physiological state of the cell or tissue. Therefore, the proteome can reflect a cellular state or the external conditions encountered by a cell, and proteome analysis can be viewed as a genome-wide assay to differentiate and study cellular states and to determine the molecular mechanisms that control them. Considering that the proteome of a differentiated cell is estimated to consist of thousands to tens of thousands of different types of proteins, with an estimated dynamic range of expression of at least 5 orders of magnitude, the prospects for proteome analysis appear daunting. However, the availability of DNA databases listing the sequence of every potentially expressed protein combined with rapid advances in technologies capable of identifying the proteins that are actually expressed now make proteomics a realistic proposition. Mass spectrometry is one of the essential legs on which current proteomics technology stands.
Quantitative proteomics is the systematic analysis of all proteins expressed by a cell or tissue with respect to their quantity and identity. The proteins expressed in a cell, tissue, biological fluid or protein complex at a given time precisely define the state of the cell or tissue at that time. The quantitative and qualitative differences between protein profiles of the same cell type in different states can be used to understand the transitions between respective states. Traditionally, proteome analysis was performed using a combination of high resolution gel electrophoresis, in particular two-dimensional gel electrophoresis, to separate proteins and mass spectrometry to identify proteins. This approach is sequential and tedious, but more importantly is fundamentally limited in that biologically important classes of proteins are essentially undetectable (Gygi et al., Proc. Natl. Acad. Sci. USA 97:9390–9395 (2000).
The completion of the genomic sequence of a number of species has catalyzed a new approach to biology typically referred to as discovery science. The essence of discovery science is the systematic and quantitative analysis of all the members of a particular class of molecules expressed by a cell or tissue. Exemplary implementations of discovery science include the systematic analysis of mRNA molecules expressed by a cell or tissue by gene expression arrays and quantitative proteomics, the systematic analysis of the proteins contained in a biological sample. A main objective of discovery science is the description of the state of a cell or tissue (activity, pathology, stress) based on the data obtained from the systematic measurement of biomolecules and the identification of the molecular mechanisms that control the transition of a cell from one state to the other by the comparative analysis of the molecular composition of cells representing the two states. For the molecular description of a cellular state and mechanisms controlling it as many parameters as possible are desirable. Current expression array methods allow the systematic analysis of the mRNA molecules in a cell.
Recently, a method based on a class of reagents termed isotope coded affinity tags and mass spectrometry has been described that is suitable to systematically identify and quantify the proteins present in biological samples. Other properties relevant to the state of a cell, such as protein phosphorylation and other post translational modifications, and the quantitative profiles of biomolecules other than proteins or nucleic acids, for example, lipids, second messengers, metabolites, are difficult to measure systematically and quantitatively with current technology.
Thus, there exists a need for rapid, efficient, and cost effective methods for the analysis of molecules in a cell. The present invention satisfies this need and provides related advantages as well.