The human genome contains approximately 100,000 genes, however, in any given cell, only a fraction of these genes are expressed. Thus, in each cell type, only a fraction of human genes are expressed at any one time. Each gene is expressed at a precise time and at a precise level.
Automated DNA sequencers have made it easier to determine the sequence of the genome of an organism; the genomic sequences of Haemophilus influenzae, Mycoplasma genitalium, and Caenorhabditis elegans have been published leading to the possibility that the genomic sequence of other higher organisms, such as humans, may be obtained (Fleischmann, R. D. et al., Science, 269:496, (1995); Fraser, C. M., et al., Science, 270:397 (1995); Hodgkin, J., et al., Science, 270:410, 1995)). However, the information derived from this technology still does not answer the question of which of these genes are expressed at any one time in any given cell. This information is crucial to determine how cells are differentiated from each other, how cells age, and the causes and effects of many diseases.
A typical mammalian cell of a given lineage expresses approximately 20,000-30,000 of the 100,000 odd germ line genes carried in its genome. Almost all cells constitutively express many of the same genes, which are called "housekeeping" genes. Examples of housekeeping genes include genes encoding enzymes involved in glycolysis or proteins involved in cell structure. However, it is the non-constitutively expressed genes that differentiate cells from each other. As cells mature into differentiated cells, certain non-constitutively expressed genes are turned on and off at different stages. Thus, the differences in gene expression patterns between cells make, for example, a nerve cell different from a blood cell.
Furthermore, the intracellular concentration of a non-constitutively expressed gene product can be modulated by the induction or repression of gene expression in response to environmental signals. Thus, the relative concentration of gene products within a given cell type can be indicative of the state of the cell.
Even within a single cell, the level of expression can vary a great deal from one gene to the next. In a typical cell, there are perhaps 200,000 mRNA molecules which represent 20,000-30,000 different transcribed sequences, present in the cytoplasm. A few of these transcript sequences may be present in high abundance, with thousands of copies or more present per cell. For example, up to 70% of the total mRNA in an antibody secreting plasma cell is represented by immunoglobulin mRNA. Other genes, typically housekeeping genes such as actin or glucose-6-phosphate dehydrogenase, are present at medium abundance with approximately 100-1,000 copies per cell. However, more than 90% of gene transcripts, are present in low abundance at a level of less than 10-15 copies per cell.
Under abnormal cellular conditions such as those in individuals with diseases or disorders, the pattern of gene expression within individual cells may be changed compared to the expression pattern seen under normal non-disease conditions. A change in gene expression may be an effect or the cause of a disease or other abnormality, such as in, for example, a tumor cell. Whereas some diseases may be understood as caused by mutations in particular genes and thus could potentially be detected by examining the genomic sequence, many diseases and disorders involve a malfunction in the level of expression of genes which cannot be detected by sequencing the genome but can only be detected by identifying the gene expression patterns of the cells. Therefore, in order to understand the function of specific cell types in an organism or to understand the progression of disease, it is necessary to understand the expression status of individual genes within these specific cell types at different stages of the organism's development.
One way researchers have attempted to answer these questions is to isolate proteins from various cells and to compare the abundance of each of these proteins. In one approach, proteins are purified from the cells and their abundance is compared. However, this approach is limited by difficulties in devising equally efficient methods of purifying different proteins. This approach is also limited to known proteins. In another approach, two-dimensional gel electrophoresis is used to compare protein expression, but this may lead to difficulties in resolving all of the proteins in the cell and in detecting proteins that are produced at a very low level (See Kahn, P. Science, 270:369 (1995)).
Other methods of determining peptide expression in an mRNA population involve the use of antibodies to probe populations of peptides produced from mRNA pools. Thus, "libraries" of synthetic polypeptides corresponding to the polypeptides coded for by mRNA molecules are produced and then probed by individual antibodies. This method does not provide for a detection of all of the polypeptides produced by the mRNA at one time as it may not detect low levels of expression. Moreover, the method is limited to available antibodies. This method is described in, for example, U.S. Pat. No. 5,242,798, issued Sep. 7, 1993, and in U.S. Pat. No. 4,900,811, issued Feb. 13, 1990.
Furthermore, in all of these protein detection methods, once a particular protein difference has been determined, the protein must still be partially sequenced and cloned in order to determine the gene that is responsible for expression of the protein. Alternatively, the protein must be sequenced and compared to a "proteome" database (Kahn, P. Science, 270:369, (1995)). Moreover, determining gene expression patterns by looking at purified proteins from the cell is a method of looking at secondary and tertiary effects of gene expression--translation of mRNA into protein, and post-translational modification--and not the primary effect--transcription of DNA sequences into mRNA. Detecting protein expression levels, furthermore, does not take into account the possibility that proteins may be degraded after translation and that the difference in protein expression is not actually due to a difference in gene expression.
Researchers have also focused on detecting changes in expression of individual mRNAs. One method involves subtractive hybridization, but this method does not have sufficient resolution to detect RNAs that are expressed at very low levels. Lee, S. W. et al., Proc. Natl. Acad. Sci. USA 88:2825 (1991). Another method involves a microarray hybridization assay where cDNA is prepared from two mRNA populations, labeled with two different colors, and used to hybridize to microscope slides to which a cDNA library has been fixed; differential hybridization is then identified by determining whether the sample fluoresces (See, Nowak, R., Science, 270:368, (1995); Schena et al., Science 270:467 (1995)). Because much of each mRNA sequence may not be particular to that mRNA sequence, but may also be common among many of the mRNA sequences in a particular cell, researchers have focused on short specific sequences of each mRNA called "tags" which are specific for a particular mRNA in the cell and are sufficient to identify the expression of a particular mRNA. In one such method, randomly chosen cDNA clones are made from mRNAs of a particular tissue. This bulk method of producing cDNAs results in a database of "expressed sequence tags" (Adams, M. D., et al., Science, 255:1651, 1991; Adams, M. D. Nature 355:632-634, 1992). This method of compiling a database of expressed sequence tags fails to provide any information about differential gene expression nor does it determine the frequency of expression of a gene within a cell.
Other methods have focused on using the polymerase chain reaction (PCR) to define tags and to attempt to detect differentially expressed genes. Many groups have used the PCR method to establish databases of mRNA sequence tags which could conceivably be used to compare gene expression among different tissues (Williams, J. G. K., Nucl. Acids Res. 18:6531, 1990; Welsh, J., et al. Nucl. Acids Res., 18:7213, 1990; Woodward, S. R., Mamm. Genome, 3:73, 1992; Nadeau, J. H., Mamm. Genome 3:55, 1992). This method has also been adapted to compare mRNA populations in a process called mRNA differential display. In this method, the results of PCR synthesis are subjected to gel electrophoresis, and the bands produced by two or more mRNA populations are compared. Bands present on an autoradiograph of one gel from one mRNA population, and not present on another, correspond to the presence of a particular mRNA in one population and not in the other, and thus indicate a gene that is likely to be differentially expressed. Messenger RNA derived from two different types of cells is compared by using arbitrary oligonucleotide sequences of ten nucleotides (random 10-mers) as a 5' primer and one of a set of 12 oligonucleotides complimentary to the poly A tail as a 3' "anchor primer." These primers are then used to amplify partial sequences of mRNAs with the addition of radioactive deoxyribonucleotides. These amplified sequences are then resolved on a sequencing gel such that each sequencing gel has a sequence of 50-100 mRNAs. The sequencing gels are then compared to each other to determine which amplified segments are expressed differentially (Liang, P. et al. Science 257:967, 1992; See also Welsh, J. et al., Nucl. Acid Res. 20:4965, 1992; Liang, P., et al., Nucl. Acids Res., 3269 1993).
Another method based on using PCR to detect the expression of mRNAs relies on the use of 12 anchor primers which hybridize to the poly A tract and two restriction endonucleases, one that cleaves at a 4 nucleotide sequence within the cDNA sequence that corresponds to the mRNA, and another restriction endonuclease which recognizes a single site within each anchor primer. The cDNA derived from the mRNA in each of the 12 pools is then inserted into a vector, downstream from a promoter, and used to transform host cells in order to amplify the vector containing the cDNA insert. "cRNA" antisense transcripts are then made, driven by the promoter, which are then amplified using PCR. The PCR reaction is carried out with 16 or more different primers, in 16 different subpools. Thus, with 12 different anchor primers, 192 subpools are required per mRNA sample. The results of the PCR are then resolved on a sequencing gel (WO 95/13369, Published May 18, 1995).
Yet another method to analyze gene expression in cells also relies on PCR. In this method, called SAGE, a cDNA copy of mRNA is made using a poly dT primer which is then biotinylated. The cDNA copy is then made double-stranded and then cut with an "anchoring enzyme" which recognizes a four base pair sequence present in each cDNA. The biotinylated cDNA is then bound to streptavidin beads to remove the rest of the sequence. This results in a cDNA copy of a portion of the 3' end of the messenger RNA linked to a streptavidin bead. The population of cDNAs linked to streptavidin beads is divided in half. Each half is then ligated to one of two oligonucleotide linkers containing a Type IIs restriction endonuclease recognition site. Type IIs restriction endonucleases cleave DNA at a site different than the recognition site. The sequences are cut with the Type IIs restriction endonuclease (the "tagging enzyme"), resulting in cleavage at a site within the cDNA copy of the mRNA sequence. The end of the DNA sequences are made blunt ended and ligated together in pairs, where the tag sequences are linked with one oligonucleotide linker at the 5' end, and the other at the 3' end. These "di-tags" are then amplified with PCR using primers specific to the linkers. The PCR- amplified regions are cleaved with the anchoring enzyme and concatenated together into a series of di-tags punctuated by the sequence of the anchoring enzyme recognition site. This series of di-tags linked together are then cloned into a sequencing vector and sequenced (Velculescu, V. E. et al., Science 270:484 (1995)).
The use of PCR results in problems of reproducibility and requires the use of other complicated steps, including the preparation and annealing of PCR primers, to a method of detecting gene expression patterns. Moreover, these PCR-based methods do not necessarily detect differences in the frequency of gene expression.
The abundance of a PCR product after amplification is influenced by many factors in addition to starting template abundance. Sequence specific differences in "amplification efficiency" are well known to give rise to artifactual differences quantity of PCR product in the absence of real differences in starting template. Moreover, even repetitive amplification of the same template preparation has been reported to produce product yields that can vary by as much as 6-fold (Gilliand et al. in: PCR Protocols. Academic Press, pp 60-69 (1990)). Hence, any PCR-based method that attempts to infer starting template abundance from the quantity of product produced by amplification requires stringent co-amplification controls. In the above cited "SAGE" technique, all cDNA "tags" that happen to have a highly amplifiable sequence will be over represented while those that have "difficult" sequences will be under-represented after the PCR step. The use of "ditags" fails to rectify all of the reliability problems involved in using SAGE to determine starting template abundance. Excluding any ditag that is repetitively isolated fails to eliminate all of the over-represented tag sequences. Artificially enhanced "amplifiability" may be the result of just one of the tags--in which case any ditag containing the individual member would be over-represented. Moreover, this exclusion does nothing about sequences which are artificially under-represented.
Thus, there is a need for a simple and reproducible method for detecting gene expression, identifying genes, and gene expression patterns in individual cells or tissues as well as a method for determining the frequency of gene expression in these cells or tissues.