1. Field of the Invention
The present invention relates generally to detection of gene expression and analysis of both known and unknown genes. More particularly, it provides a method that can be used for global monitoring of gene expression, as well as for the analysis and quantitation of changes in gene expression for a defined set of genes and in response to a wide variety of events. The method is highly sensitive, rapid and cost-effective.
2. Description of Related Art
The degree of differentiation or physiological state of a cell, a tissue or an organism is characterized by a specific expression status, i.e., the degree of transcriptional activation of all genes or particular groups of genes. The molecular basis for numerous biological processes that result in a change in this state is the coordinated transcriptional activation or inactivation of particular genes or groups of genes in a cell, an organ or an organism. Characterization of this expression status is of key importance for answering many biological questions. Changes in gene expression in response to a stimulus, a developmental stage, a pathological state or a physiological state are important in determining the nature and mechanism of the change and in finding cures that could reverse a pathological condition. Patterns of gene expression are also expected to be useful in the diagnosis of pathological conditions, and for example, may provide a basis for the sub-classification of functionally different subtypes of cancerous conditions.
Several methods that can analyze the expression status of genes are known in the art. Differential display RT-PCR™ (DDRT) is one method for analyzing differential gene expression in which subpopulations of complementary DNA (cDNA) are generated by reverse transcription of mRNA by using a cDNA primer with a 3′ extension (preferably two bases). Random 10-base primers are then used to generate PCR™ products of transcript-specific lengths. If the number of primer combinations used is large enough, it is statistically possible to detect almost all transcripts present in any given sample. PCR™ products obtained from two or more samples are then electrophoresed next to one another on a gel and differences in expression are directly compared. Differentially expressed bands can be cut out of the gel, reamplified and cloned for further analysis.
It is possible to enrich the PCR™ amplification products for a particular subgroup of all mRNA molecules, e.g., members of a particular gene family by using one primer which has a sequence specific for a gene family in combination with one of the 10 base random primers. This technique of DDRT is described by (Liang and Pardee, 1992; Liang et al., 1993; Bauer et al., 1993; Stone and Wharton, 1994; Wang and Feuerstein, 1995; WO 93/18176; and DE 43 17 414).
There are a number of disadvantages to the experimental design of DDRT. The differential banding patterns are often only poorly reproducible. Due to the design of the primers even the use of longer random primers of, e.g., 20 bases in length does not satisfactorily solve the problem of reproducibility (Ito et al., 1994). In order to evaluate a significant portion of differentially expressed genes, a large number of primer combinations must be used and multiple replicates of each study must be done. The method often results in a high proportion of false positive results and rare transcripts cannot be detected in many DDRT studies (Bertioli et al., 1995. )
Due to the non-stringent PCR™ conditions and the use of only one arbitrary primer further analysis by sequencing is necessary to identify the gene. Sequencing of selected bands is problematic since the same primer often flanks DDRT products at both ends so that direct sequencing is not possible and an additional cloning step is necessary. Due to the use of short primers, a further reamplification step with primer molecules extended on the 5′ side is necessary even if two different primers flank the product. Finally, due to the use of random primers, it is never quite possible to be sure that the primer combinations recognize all transcripts of a cell. This applies, even when using a high number of primers, to studies which are intended to detect the entirety of all transcripts as well as to studies which are directed towards the analysis of a subpopulation of transcripts such as a gene family (Bertioli et al., 1995).
A variant of DDRT, known as GeneCalling, has recently been described (Shimkets et al., 1999) which addresses some of these problems. In this method, multiple pairs of restriction endonucleases are used to prepare specific fragments of a cDNA population prior to amplification with pairs of universal primers. This improves the reproducibility of the measurements and the false positive rate, but the patterns are very complex and identification of individual transcripts requires the synthesis of a unique oligonucleotide for each gene to be tested. In addition, the quantitative data obtained are apparently significant only for changes above 4-fold (Shimkets et al. 1999) and only a weak correlation with other techniques is obtained. The ability of the technique to distinguish the gene-specific band from the complex background for any arbitrarily chosen gene has not been documented (Shimkets et al., 1999).
AFLP based mRNA fingerprinting further addresses some of the deficiencies of DDRT. AFLP allows for the systematic comparison of the differential expression of genes between RNA samples (Habu, 1997) The technique involves the endonuclease digestion of immobilized cDNA by a single restriction enzyme. The digested fragments are then ligated with a linker specific for the restriction cut site. The tailed fragments are subsequently amplified by PCR™ employing primers complementary to the linkers added to the digest with the addition of variable nucleotides at the 3′ end of the primers. The products of the amplification are visualized by PAGE and banding patterns compared to reveal differences in RNA transcription patterns between samples. Although AFLP based RNA fingerprinting provides a indication of the RNA message present in a given sample, it fails to restrict the potential number of signals produced by each individual RNA strand. With this technique, each RNA strand may potentially produce multiple fragments and therefore multiple signals upon amplification. This failure to restrict the number of signals from each message complicates the results that must be evaluated.
Song and Osborn, 1994, describe a method for examining the expression of homologous genes in plant polyploids in which the techniques of RT-PCR™ and RFLP (restriction fragment length polymorphism) analysis are combined with one another. In this method a cDNA is produced from RNA by reverse transcription, then amplified by using two gene-specific primers. The amplification products are transcript-specifically shortened by endonuclease cleavage, separated by electrophoresis according to their length, cloned, and then analyzed by sequencing. This method has the disadvantage of low sensitivity, as a cloning step is necessary to characterize the expression products. A further disadvantage of this method is that gene specific sequence information must be available on at least two regions within the analyzed genes in order to design suitable primers.
In principle, gene expression data for a particular biological sample could be obtained by large-scale sequencing of a cDNA library. The role of sequencing cDNA, generated by reverse transcription from mRNA, has been debated for its value in the human genome project. Proponents of genomic sequencing have argued the difficulty of finding every mRNA expressed in all tissues, cell types, and developmental stages. It is also believed that cDNA libraries do not provide all sequences corresponding to structural and regulatory polypeptides (Putney et al., 1983). In addition, libraries of cDNA may to be dominated by repetitive elements, mitochondrial genes, ribosomal RNA genes, and other nuclear genes comprising common or housekeeping sequences. While some mRNAs are abundant, others are rare, resulting in cellular quantities of mRNA from various genes that can vary by several orders of magnitude. Therefore, sequencing of transcribed regions of the genome using cDNA libraries has been considered unsatisfactory.
Techniques based on cDNA subtraction or differential display can be used to compare gene expression patterns between two cell types (Hedrick et al., 1984; Liang and Pardee, 1992), but provide only a partial analysis, with no quantitative information regarding the abundance of messenger RNA. Expressed sequence tags (EST) have been valuable for gene discovery (Adams et al., 1993; Okubo et al., 1992), but like Northern blotting, RNase protection, and reverse transcriptase-polymerase chain reaction (RT-PCR™) analysis (Alwine et al., 1977; Zinn et al, 1983; Veres et al., 1987) the approach only evaluates a limited number of genes at a time.
Two major strategies for global gene expression analysis have recently become available. Serial analysis of gene expression (SAGE) (U.S. Pat. No. 5,866,330, Kinzler, et al., 1995) is based on the use of short (i.e. 9-10 base pair) nucleotide sequence tags that identify a defined position in an mRNA and are used to ascertain the identity of the corresponding transcript and gene. The cDNA tags are generated from mRNA samples, randomly paired, concatenated, cloned, and sequenced. While this method allows the analysis of a large number of transcripts, the identification of individual genes requires sequencing of tens of thousands of tags for comparison of even a small number of samples. Although SAGE provides a comprehensive picture of gene expression, it cannot be specifically directed at a small subset of the transcriptome (Zhang et al., 1997; Velculescu et al., 1995). Data on the most abundant transcripts is the easiest and fastest to obtain, while about a megabase of sequencing data is needed for confident analysis of low abundance transcripts.
The second method utilizes hybridization of cDNAs or mRNAs to microarrays containing hundreds or thousands of individual cDNA fragments or oligonucleotides specific for particular genes or ESTs. The matrix for hybridization is either a DNA chip, a slide or a membrane. This method can be used to direct a search towards specific subsets of genes, but cannot be used to identify novel genes. In addition, arrays are expensive to produce (DeRisi et al.,1996; Schena et al., 1995). For those methods using cDNA arrays, a library of individually cloned DNA fragments must be maintained with at least one clone for each gene to be analyzed. Because much of the expense of utilizing microarrays lies in maintaining the fragment libraries and programming equipment to construct the microarray, it is only cost-efficient to produce large numbers of identical arrays. These two techniques lack the flexibility to easily change the subset of the transcriptome being analyzed or to focus on smaller subsets of genes for more detailed analyses.
As described above, current techniques for analysis of gene expression either monitor one gene at a time, are designed for the simultaneous and therefore more laborious analysis of thousands of genes or do not adequately restrict the signal to message ratio. There is a need for improved methods which encompass both rapid, detailed analysis of global expression patterns of genes as well as expression patterns of defined sets of genes for the investigation of a variety of biological applications. This is particularly true for establishing changes in the pattern of gene expression in the same cell type, for example, in different developmental stages, under different physiologic or pathologic conditions, when treated with different pharmaceuticals, mutagens, carcinogens, etc. Identification of differential patterns of expression has several utilities, including the identification of appropriate therapeutic targets, candidate genes for gene therapy (including gene replacement), tissue typing, forensic identification, mapping locations of disease-associated genes, and for the identification of diagnostic and prognostic indicator genes.
The object of the present invention is to provide a method for gene expression analysis which exceeds the capabilities of the state of the art. The optimal method should be rapid and cost-effective, allow easily reproducible and quantitative results, have an adequate sensitivity in order to detect and quantify rare transcripts, and enable identification of amplification products by techniques that do not require an additional cloning or sequencing step. The technique should allow flexibility to analyze either a subset or the complete transcriptome, and should be useful for both gene discovery and to analyze previously identified genes.