The identification of nucleic acids by their sequence is important to the study of gene expression and regulation, to epidemiology and public health, to diagnostics and prognostics, to heredity determination (such as paternity determination), and to forensics. The ability of one strand of a nucleic acid molecule to hybridize to a complementary stand of another nucleic acid molecule allows for the capture of nucleic acid molecules of interest from a population of nucleic acid molecules that may be large and complex. Such capture can lead to the identification and/or purification of nucleic acid molecules of interest in complex populations of nucleic acid molecules, such as the DNA making up the genome of a human being or the population of RNA molecules that are expressed by a cell under certain conditions, for example, a disease state.
Analysis of the expression of RNA transcripts by electrophoresis, blotting to membranes, and hybridization of labeled probes (“Northern blots”) can provide quantitative data on the expression of genes. However, this method of analysis is labor-intensive and time consuming. In addition, the sensitivity of this method is relatively low, and it is impractical for analyzing the expression of many different genes, as hybridization with each additional probe corresponding to a different gene requires a round of stripping the old probe from the membrane, hybridizing the new probe, washing the membrane, and audoradiography for signal detection.
RNase protection assays allow for increased sensitivity, more reliable quantitation, and the analysis of multiple RNA transcripts in a single hybridization reaction. However, the number of genes that can be analyzed in one reaction is still relatively low, and gel electrophoresis and autoradiography are required, which are labor and time-consuming.
Nucleic acid chips or arrays allow for the identification of a large set of nucleic acid molecules simultaneously (see, for example, Debouck and Goodfellow (1999) Nature Genetics Suppl., 21: 48-50; Duggan, et al. (1999) Nature Genetics Suppl., 21: 10-14; Gerhold et al. (1999) Trends Biochem Sci. 24: 168-173; Alizadeh et al., Nature 403: 503-5110). When applied to the study of gene expression, the use of gene chips or arrays can rapidly identify a set of genes expressed under given conditions. Such methods typically involve hybridizing cDNA synthesized from RNA by reverse transcription to a DNA array that has sequences from many genes attached to it in an ordered pattern. The cDNA is labeled by incorporation of labeled nucleotides during synthesis (see, for example, Schena et al. (1995) Science 270: 467-470), or in some cases by the incorporation of labeled primers (U. S. Pat. No. 6,004,755 issued Dec. 21, 1999 to Wang). However, the efficiency of reverse transcription can vary among different RNA transcripts, such that the incorporation of label may be quite variable. Variable rates of reverse transcription can also lead to under or over-representation of particular cDNAs with respect to the original RNA transcript population. Another difficulty is that cDNAs synthesized by reverse transcription of RNA transcripts will hybridize with different efficiencies to nucleic acids on solid supports, due to the variability of their lengths. Thus it is difficult to obtain accurate data on the levels of expression of genes in a population. This is particularly problematic when comparing two populations of RNA, in which the two populations may be standardized with respect to levels of expression of a particular message.
Mutations are alterations in the genome with respect to the standard wild-type sequence. Mutations can be deletions, insertions, or rearrangements of nucleic acid sequences at a position in the genome, or they can be single base changes at a position in the genome, referred to as “point mutations”. Mutations can be inherited, or they can occur in certain cells during the lifespan of an individual. Particular mutations can be correlated with certain cancers, or with the degree of malignancy of certain cancers.
Single nucleotide polymorphisms (SNPs) are positions of variablilty in the genome due to a single base change with respect to the wild type sequence. In some cases, SNPs are point mutations that are diagnostic of genetic defects, for example sickle cell anemia. SNPs can also be positions in the genome where some degree of variability is expected among a population, such as a human population. SNPs can correlate with the ability of a patient to respond positively or negatively to one or more drugs or medications, and thus their identification can be useful in pharmacogenetics. Identifying the nucleotides at particular SNP sites can also be used to identify an individual with a high degree of reliability, and thus can have value in heredity determinations, criminology, and forensics.
While point mutations and SNPs can have profound consequences on the health of an individual and provide a highly reliable tool for identifying an individual, they are somewhat difficult to detect. There are currently several variations on methods of detecting mutations and SNPs on DNA arrays. These methods rely on amplifying a subject's DNA prior to hybridization and identification on the chip. Amplification methods can result in misincorporated bases that can provide inaccurate information on the identity of bases at known or suspected mutation or SNP sites. Moreover, in many cases it is important to identify mutations or SNPs in genes that are expressed, and many genes may not be expressed in a given tissue at a particular time. It is also desirable to identify genes or regions of genes that can be amplified or deleted in genetic disorders or cancers. In many cases, tumor classification can be aided by identifying characteristic patterns of gene amplification or deletion (Pollack et al. (1999) Nature Genetics 23: 41-46; Arribas et al. (1999) Clin. Cancer Res. 5: 3454-9; Tanner et al. (1995) Clin. Cancer Res. 1: 1455-61). Methods of mutation analysis that rely on PCR are difficult to quantitate, and those that rely on gel electrophoresis are time-consuming and can only analyze a limited number of genes in a single test. SNPs can also be detected by mass spectrometry-based methods that detect molecular weight differences of DNA fragments that contain SNP sites. This method is limited by the resolution of mass spectrometry and on the requirement for expensive equipment
The present invention recognizes that it is difficult to obtain reliable quantitative data on the expression of genes using solid supports, and that it is difficult, labor-intensive, and time-consuming to obtain information on the expression of genes using current Rnase-protection methods. The present invention also recognizes that there is a need to efficiently characterize particular mutations or sequence variations, such as SNPs or gene amplifications, that may characterize certain disease states or genotypes and that can provide information on the sequence of genes that are expressed by a subject.