The ability to understand and treat complex disorders like cancer, heart disease, and diabetes is limited by our knowledge of the manifold genetic and environmental factors that contribute to disease initiation and progression. This need has driven the development of genomics technologies that merge brute-force data generation with sophisticated computational analysis to provide a comprehensive picture of biological processes. Early work in genomics focused on genome sequencing technologies, which aim to provide a catalog of the genetic blueprint of an organism. This static catalog, however, does not provide direct information about the dynamic interplay between gene expression levels and environmental factors relevant to disease.
Gene expression analysis is a type of genomics technology that provides a snapshot of the expression levels of the various genes within a biological sample such as a cell line, a tissue, an organ, or a whole organism. The RNA transcripts are extracted from cells, and the abundances of different species are measured. By measuring gene expression levels at different experimental conditions, for example at disease progression time points, at different developmental stages, for different disease models, or for different therapeutic treatments, genes can be associated with medically and scientifically relevant biological processes.
One method for gene expression profiling, known as expressed sequence tag (EST) sequencing, is to generate a cDNA library from a biological sample, then sequence a number of clones from the library. Sequencing typically starts from the 3′ end of a transcript, generating a unique and reproducible sequence tag for each transcript. The number of times each tag is sequenced is tabulated, yielding a measurement of transcript abundance. The limitation of this method is that it is not able to provide reliable expression level measurements for genes expressed at low abundance. If 1000 tags are sequenced, for example, only genes expressed at a level of 1:1000 or greater will be typically detected. In a tissue of typical complexity, 10,000 or more transcripts are present, and a typical abundance is 1:10,000, below the sensitivity threshold of EST sequencing.
Serial analysis of gene expression (SAGE) is an improved method of EST sequencing in which terminal regions of multiple transcripts are concatenated prior to sequencing, providing approximately a 10-fold improvement in sensitivity. Even with this improvement, however, SAGE is costly and time-consuming.
More recent methods of gene expression analysis employ hybridization of nucleic acid sequences generated from transcript pools to microarray or chip surfaces to which have been attached complementary nucleic acid sequences. These methods are restricted to probing the expression levels of known genes. Preparing the nucleic-acid-derivatized surfaces can be a costly and time-consuming limitation. Background hybridization limits the sensitivity of these methods to low-abundance genes. Cross-hybridization between homologous genes with high sequence identity, for example 70% or greater, limits the selectivity of hybridization methods. Cross-hybridization can also limit the ability to distinguish between splice variants and allelic variants, including single nucleotide polymorphism (SNP) variants that are gaining importance as markers for association studies.
Differential display methods provide an alternative approach for measuring gene expression levels. These methods start with an RNA transcript pool, possibly in the form of cDNA, then use restriction enzyme (RE) pairs or primer pairs to selectively amplify fragments from certain transcripts within the pool. The fragments are analyzed experimentally, typically using gel electrophoresis, to generate a characteristic banding pattern in which the position of a band corresponds to the length of a fragment and the intensity of the band corresponds to the abundance of the fragment. Comparing banding patterns generated by different samples permits the identification of bands whose intensities vary, corresponding to differentially expressed genes. By performing this process using different RE pairs or primer pairs, the majority of transcripts in the original pool generate fragments that can be detected. This method can detect fragments from genes expressed at low levels, 1:50,000 or less, on par with or better than the levels achievable by competing technologies such as hybridization. Differential display also provides the capability to distinguish between homologs and variants by precise determination of size polymorphisms and by the presence or absence of restriction sites even in closely related sequences. A key advantage of differential display over hybridization approaches is that knowledge of transcript sequences is not a prerequisite to experimental analysis.
In its original form, differential display had a significant drawback in the lack of a convenient method to identify the particular nucleic acid sequence, including a gene, responsible for a band in a differential display pattern. In order to determine the gene sequence responsible for a differentially expressed band, it was necessary to physically isolate the DNA sequences generating the band, requiring cutting out a piece of electrophoresis gel, eluting the DNA, and sequencing several clones. After identifying one or more distinct DNA sequences, then one proceeded to use other techniques to conclusively identify the particular sequence that was differentially expressed.
Rothberg et al. describe a method for alleviating the difficulty of this procedure by comparing experimentally detected bands to a database of bands predicted for known gene sequences (U.S. Pat. No. 5,871,697; Shimkets et al. 1999 Nature Biotechnology 17:798–803). Furthermore, a method for rapidly confirming a band prediction made by differential expression and database lookup involves conducting an amplification procedure for detecting the band in the presence of an inhibitory primer that typically is nonlabeled. Thus, even if amplification occurs, the amplicon will not be detected. This procedure is termed “poisoning” herein; use of terms such as “poisoned” and so forth in the description also relates to this procedure. Poisoning is described in full in U.S. Ser. No. 09/381,779 filed Aug. 7, 1998, incorporated herein in its entirety.
In practice, there are two primary limitations of this method. The first limitation is that the height of a particular band provides a relative, rather than absolute, measure of gene expression. Comparing the height of a particular band between different samples provides an estimate of relative expression of the gene responsible for the band that is reliable for +/−1.5-fold ratios. Comparing heights of bands from different genes within a single sample does not, however, indicate the relative absolute abundance of these two genes. This limitation is common to many gene expression methods. Consequently, many profiles are expressed in terms of n-fold ratio compared to an arbitrary reference state.
The second limitation is providing a reliable band-to-gene database look-up. Often, a database look-up provides multiple sequences that could correspond to a particular band, whether differentially expressed or not. Even with the physical confirmation method described above, it is inefficient to use trial-and-error to test each sequence that could have contributed a band. Furthermore, even if only one particular nucleic acid sequence, including a gene, is predicted to generate a band corresponding to an experimentally detected band, it is still not definite that the gene actually did so. A gene sequence not in the database, for example, could be responsible.
Ranking the sequences in order of relative likelihood of generating the particular band would provide important aid to an experimentalist in interpreting a differential display pattern, or, more generally, of a direct read-out of peak heights following fragment generation from a cDNA pool. Another useful method would provide a numeric score, preferably in the form of either a probability or a p-value, that a particular nucleic acid sequence, including a gene, contributed a particular band.