1. Field of the Invention
The present invention relates to a method of identifying proteins which are differentially expressed in a plurality of biological samples. An important component of the inventive method is digesting the proteins contained in the samples to produce peptides, followed by labeling of the peptides. The labeled peptides are then separated, and those peptides that are differentially present in the samples are identified. The identity of the peptides that are differentially present in the samples may be used to determine which protein(s) in the original biological samples were differentially expressed.
2. Background of the Invention
Of the many genes in the genome, some are expressed in virtually all cells, whereas others are expressed in cell- and tissue-specific patterns. Cancer cells express different genes as compared to normal cells. These genes encode proteins that, in turn, regulate all aspects of cell function, including those that give rise to neoplastic characteristics. In principle, analyses of expressed genes in tumor cells versus normal cells should indicate which genes and gene products are characteristic of the neoplastic phenotype. DNA microarrays and related techniques have revolutionized the approach to this problem. Indeed, this approach is expected to yield profound insights into not only into the induction of neoplasia, but also the responses of neoplastic cells to therapeutic agents. Nevertheless, the correspondence between what genes are expressed and what proteins are produced is uncertain. Recent studies indicate that the correlation between mRNA induction and elevation in protein content for several enzymes is relatively poor. Variations in protein expression may be due not only to variations in gene expression, but also to variability in mRNA stability, translational efficiency, protein stability and turnover. Thus, analyzes of genomic expression patterns will not necessarily provide an accurate picture of the status of the truly functional cellular machineryxe2x80x94the proteins.
The term proteome was introduced by Wilkins and colleagues to describe the protein complement of the genome. It is estimated that human cells contain between 50,000 and 100,000 expressed proteins. Proteomics has emerged as a buzzword complement to genomics. Proteomics describes the study of the proteome and changes in its status. In its simplest form, proteomics is simply an exercise in xe2x80x9cminingxe2x80x9d samples to identify the proteins present. However, the main attraction of applied proteomics in cancer research is that it can reveal key differences between the proteomes of normal and neoplastic cells. In addition, applied proteomics will reveal unique proteins or protein expression patterns of neoplastic cells, both of which can serve the task of molecular diagnosis of cancer.
Previous work describing changes in the expression of single genes or changes in the status of single proteins provided very specific information that could be interpreted in a limited biochemical context. Although the impact of many specific changes in gene and protein regulation is understood, we now realize that the factors regulating cell growth and differentiation act in complex, interlocking pathways. Accordingly, changes in biochemical signaling pathways, networks, and regulatory cascades, rather than in single enzymes describe how cells grow, differentiate and die. By collectively describing complex, multicomponent systems, both genomics and proteomics promise a quantum leap in our level of understanding of the biology of cancer. A crucial task of the new biology is to make mechanistic sense of these changes. First, however, it is necessary to reliably detect and describe them. DNA microarray techniques now make this possible in the context of gene expression. However, no equivalent methodology yet exists to reliably compare, characterize and define patterns of protein expression between cells and tissues.
Investigators studying proteomics are at an enormous disadvantage compared to their genome scientist colleagues. First, unlike nucleic acids, proteins do not hybridize to complementary sequences. Second, there is no protein equivalent of the polymerase chain reaction (PCR). Proteomics thus requires other means of separating proteins in complex mixtures and identifying both low-and high-abundance species. The most powerful method currently available to resolve complex protein mixtures is 2D gel electrophoresis. In this technique, proteins are resolved on the basis of some physical property (e.g., isoelectric point) in a first dimension separation, and then by molecular weight in the second dimension. Many individual proteins from complex cell extracts can be resolved on 2D gels. Although 2D gels are currently the most widely used separation tool in proteomics, it is worth noting that reverse phase HPLC, capillary electrophoresis, isoelectric focusing and related hybrid techniques also provide powerful means of resolving complex protein mixtures.
Regardless of the means by which they are resolved, proteins must next be identified, primarily on the basis of sequence information. N-terminal Edman sequencing provides useful information in many cases, except where N-terminal modifications block analysis. The state-of-the-art approach to protein identification is mass spectrometry (MS). Spots containing the proteins of interest typically are excised from gels and subjected to proteolytic digestion. The resulting peptides may be analyzed by electrospray (ESI) or matrix-assisted laser desorption ionization (MALDI) MS. Sequence information is obtained with triple quadruple or ion trap mass analyzers by collision induced dissociation (CID) or on time of flight (TOF) mass analyzers by post-source decay. In either case, the ability of MS instruments to perform MS-MS experiments allows unambiguous assignment of peptide sequence. This information then may be used with sophisticated database search programs, such as SEQUEST, to identify proteins in World Wide Web protein and nucleic acid databases from the MS-MS spectra of their peptides. This combination of separation technology, MS analysis, and database searching makes the high-throughput identification of proteins in complex mixtures possible and has been the driving force behind the recent explosive growth of the proteomics field. With the continued growth of databases, it will possible to identify virtually all proteins from any 2D gel with these approaches. Indeed, investigators studying proteins of S. cerviseae, in which the entire genome has been sequenced, have made excellent progress in characterizing the yeast proteome.
With few exceptions, 2D gel approaches dominate the proteomics field today. Not surprisingly, a great deal of effort has been directed at overcoming the major technical limitations of 2D gel electrophoresis. Briefly, these limitations are:
(1) difficulties in solubilizing and achieving isoelectric focusing (i.e. 1st dimension) separations with proteins of a wide range of isoelectric points and solubility,
(2) difficulties in achieving run-to-run and laboratory-to-laboratory reproducibility in 2D gel profiles,
(3) problems in resolving the many proteins typically present in the 30-100 kDa MW range, and
(4) difficulty in detecting low abundance proteins.
While progress has been made in addressing all of these problems, 2D gel technology is ultimately limited by the diversity of proteins to be analyzed, both in physical properties and abundance. Continued improvements certainly can be expected, but the 2D gel approach for proteins will ultimately prove inadequate to the demands of proteomics.
Accordingly, there remains a need for improved methods of assaying for differential expression of proteins in biological samples.
It is an object of the present invention to provide a method to identify proteins that are differentially expressed in different biological samples, e.g., cell or tissue samples.
It is another object of the invention to provide a method to identify peptides derived from proteins which are differentially expressed in different biological samples.
The present invention exploits the fact that relatively short peptide sequences (e.g., 6-mer or larger; equivalent to an 18-mer oligo) are largely unique in proteins. This means that sequence identification of one or more such peptides in a protein digest is sufficient to establish the presence of the precursor protein in a sample with a high degree of confidence. Thus, if one identifies, for example, an 8-mer peptide, a search of a protein or nucleotide sequence database will permit the sequence to be localized to a specific protein.
The present invention is based, in part, on detecting the differential expression of the same protein in two examples, or the presence of protein(s) in some, but not all, samples by analysis of peptide fragments from each sample. To that end, the method of the present invention includes digesting the proteins in two samples to a mixture of peptides and then comparing the abundances of specific peptides. A protein that is abundantly expressed in one sample will give rise to greater amounts of product peptides upon digestion than the same protein expressed in another sample at trace amounts. Thus, the task of identifying differentially expressed proteins between two samples involves 1) digestion of two samples, 2) detection and selection of peptides that are present in different amounts in the two samples, and 3) sequence analysis of the selected peptides and identification of the protein precursors.
The objects of the present invention, and others, may be accomplished with a method of detecting peptide fragments of protein(s) that are differentially present in biological samples, by
digesting the proteins in a plurality of biological samples to produce peptides in each sample;
separating the peptides in the samples; and
identifying the peptides that are differentially present in the samples.
The objects of the present invention, and others, may also be accomplished with a method of identifying protein(s) that are differentially present in biological samples, by
detecting peptide fragments of protein(s) that are differentially present in biological samples as described above;
determining the amino acid sequence of at least a portion of the peptide fragments; and
correlating the amino acid sequences of the peptide fragments with the identity of the protein(s) that are differentially present in the samples.