The genomes of all organisms undergo spontaneous mutation in the course of their continuing evolution generating variant forms of progenitor sequences (Gusella, Ann. Rev. Biochem. 55, 831-854 (1986)). The variant form may confer an evolutionary advantage or disadvantage relative to a progenitor form or may be neutral. In some instances, a variant form confers a lethal disadvantage and is not transmitted to subsequent generations of the organism. In other instances, a variant form confers an evolutionary advantage to the species and is eventually incorporated into the DNA of many or most members of the species and effectively becomes the progenitor form. In many instances, both progenitor and variant form(s) survive and co-exist in a species population. The coexistence of multiple forms of a sequence gives rise to polymorphisms.
Several different types of polymorphism have been reported. A restriction fragment length polymorphism (RFLP) means a variation in DNA sequence that alters the length of a restriction fragment as described in Botstein et al., Am. J. Hum. Genet. 32, 314-331 (1980). Other polymorphisms take the form of short tandem repeats (STRs) that include tandem di-, tri- and tetra-nucleotide repeated motifs. Some polymorphisms take the form of single nucleotide variations between individuals of the same species. Such polymorphisms are far more frequent than RFLPs, STRs and VNTRs. Single nucleotide polymorphisms can occur anywhere in protein-coding sequences, intronic sequences, regulatory sequences, or intergenomic regions.
Many polymorphisms probably have little or no phenotypic effect. Some polymorphisms, principally those occurring within coding sequences, are known to be the direct cause of serious genetic diseases, such as sickle cell anemia. Polymorphisms occurring within a coding sequence typically exert their phenotypic effect by leading to a truncated or altered expression product. Still other polymorphisms, particularly those in promoter regions and other regulatory sequences, may influence a range of disease-susceptibility, behavioral and other phenotypic traits through their effect on gene expression levels. That is, such polymorphisms may lead to increased or decreased levels of gene expression without necessarily affecting the nature of the expression product.
The invention provides methods of monitoring expression levels of different polymorphic forms of a gene. Such methods entail analyzing genomic DNA from an individual to determine the presence of heterozygous polymorphic forms at a polymorphic site within a transcribed sequence of a gene of interest. RNA from a tissue of the individual in which the gene is expressed is then analyzed to determine relative proportions of polymorphic forms in transcript of the gene.
In some methods, genomic DNA is analyzed by amplifying a segment of genomic DNA from a sample and hybridizing the amplified genomic DNA to an array of immobilized probes. In some methods the array used for analyzing genomic DNA comprises a first probe group comprising one or more probes exactly complementary to a first polymorphic form of the gene and a second probe group comprising one or more probes exactly complementary to a second polymorphic form of the gene. In some methods, RNA is analyzed by reverse transcribing and amplifying mRNA expressed from the gene to produce an amplified nucleic acid and hybridizing the amplified nucleic acid to an array of immobilized probes. In some such methods, the amplified nucleic acid is cDNA. In some methods, the array of immobilized probes for analyzing RNA comprises a first probe group comprising one or more probes exactly complementary to a first polymorphic form of the gene, a second probe group comprising one or more probes exactly complementary to a second polymorphic form of the gene.
In some method, genomic DNA and the RNA are analyzed by hybridizing the genomic DNA or an amplification product thereof, and the RNA or an amplification product thereof, to the same array of immobilized probes comprising a first probe group comprising one or more probes exactly complementary to a first polymorphic form of the gene, and a second probe group comprising one or more probes exactly complementary to a second polymorphic form of the gene. In some methods, the genomic DNA, or amplification product, and the RNA, or amplification product, bear different labels and are hybridized simultaneously to the array.
Some methods further comprise comparing a genomic DNA hybridization intensity of the first probe group to the second group to determine a genomic hybridization ratio, and comparing an RNA hybridization intensity of the first group to the second group to determine an RNA hybridization ratio, whereby a difference in the genomic DNA and RNA ratios indicates that the polymorphic forms of the gene are expressed at different levels in the individual.
Some methods further comprise sequencing a nontranscribed region of the gene to identify a second polymorphic site in a promoter or enhancer region of the gene.
The invention further provides methods of monitoring expression levels of different polymorphic forms of a collection of genes. In such methods, genomic DNA, or an amplification product thereof from an individual is hybridized to an array of immobilized probes comprising a subarray of probes for each gene in the collection, wherein each subarray comprises a first group of one or more probes exactly complementary to a first polymorphic form of the gene and a second group of one or more probes exactly complementary to a second polymorphic form of the gene. The relative hybridization of the first and second group of probes to the genomic DNA or amplification product thereof are analyzed for each subarray to identify heterozygous genes in the individual. RNA or an amplification product thereof from the individual is hybridized to the array of immobilized probes. The hybridization intensities of the first and second groups of probes to the RNA or amplification product are compared to identify a subset of the heterozygous genes for which different polymorphic forms are expressed at different levels. Such methods can be performed to screen large collections of genes, e.g., 100, 1000, or 100,000. Some such methods further comprise sequencing a nontranscribed region of a gene in the subset to identify a further polymorphism in a promoter, enhancer or intronic sequence of the gene.
A nucleic acid is a deoxyribonucleotide or ribonucleotide polymer in either single-or double-stranded form, including known analogs of natural nucleotides unless otherwise indicated.
An oligonucleotide is a single-stranded nucleic acid ranging in length from 2 to about 500 bases. Oligonucleotides are often synthetic but can also be produced from naturally occurring polynucleotides.
A probe is an oligonucleotide capable of binding to a target nucleic acid of complementary sequence through one or more types of chemical bonds, usually through complementary base pairing, usually through hydrogen bond formation. Oligonucleotides probes are often 10-50 or 15-30 bases long. An oligonucleotide probe may include natural (i.e. A, G, C, or T) or modified bases (7-deazaguanosine, inosine, etc.). In addition, the bases in oligonucleotide probe may be joined by a linkage other than a phosphodiester bond, so long as it does not interfere with hybridization. Thus, oligonucleotide probes may be peptide nucleic acids in which the constituent bases are joined by peptide bonds rather than phosphodiester linkages.
Specific hybridization refers to the binding, duplexing, or hybridizing of a molecule only to a particular nucleotide sequence under stringent conditions when that sequence is present in a complex mixture (e.g., total cellular) DNA or RNA. Stringent conditions are conditions under which a probe will hybridize to its target subsequence, but to no other sequences. Stringent conditions are sequence-dependent and are different in different circumstances. Longer sequences hybridize specifically at higher temperatures. Generally, stringent conditions are selected to be about 5xc2x0 C. lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH. The Tm is the temperature (under defined ionic strength, pH, and nucleic acid concentration) at which 50% of the probes complementary to the target sequence hybridize to the target sequence at equilibrium. (As the target sequences are generally present in excess, at Tm, 50% of the probes are occupied at equilibrium). Typically, stringent conditions include a salt concentration of at least about 0.01 to 1.0 M Na ion concentration (or other salts) at pH 7.0 to 8.3 and the temperature is at least about 30xc2x0 C. for short probes (e.g., 10 to 50 nucleotides). Stringent conditions can also be achieved with the addition of destabilizing agents such as formamide. For example, conditions of 5xc3x97SSPE (750 mM NaCl, 50 mM NaPhosphate, 5 mM EDTA, pH 7.4) and a temperature of 25-30xc2x0 C. are suitable for allele-specific probe hybridizations.
A perfectly matched probe has a sequence perfectly complementary to a particular target sequence. The test probe is typically perfectly complementary to a portion (subsequence) of the target sequence. The term xe2x80x9cmismatch probexe2x80x9d refer to probes whose sequence is deliberately selected not to be perfectly complementary to a particular target sequence. Although the mismatch(s) may be located anywhere in the mismatch probe, terminal mismatches are less desirable as a terminal mismatch is less likely to prevent hybridization of the target sequence. Thus, probes are often designed to have the mismatch located at or near the center of the probe such that the mismatch is most likely to destabilize the duplex with the target sequence under the test hybridization conditions.
Transcriptions levels can be quantified absolutely or relatively. Absolute quantification can be accomplished by inclusion of known concentration(s) of one or more target nucleic acids (e.g. control nucleic acids such as Bio B or with known amounts the target nucleic acids themselves) and referencing the hybridization intensity of unknowns with the known target nucleic acids (e.g. through generation of a standard curve). Alternatively, relative quantification can be accomplished by comparison of hybridization signals between two or more polymorphic forms of a transcript.
A polymorphic marker or site is the locus at which divergence occurs. Preferred markers have at least two alleles, each occurring at frequency of greater than 1%, and more preferably greater than 10% or 20% of a selected population. A polymorphic locus may be as small as one base pair. Polymorphic markers include restriction fragment length polymorphisms, variable number of tandem repeats (VNTR""s), hypervariable regions, minisatellites, dinucleotide repeats, trinucleotide repeats, tetranucleotide repeats, simple sequence repeats, and insertion elements such as Alu. The first identified allelic form is arbitrarily designated as a the reference form and other allelic forms are designated as alternative or variant alleles. The allelic form occurring most frequently in a selected population is sometimes referred to as the wildtype form. Diploid organisms may be homozygous or heterozygous for allelic forms. A diallelic polymorphism has two forms. A triallelic polymorphism has three forms.
A single nucleotide polymorphism (SNP) occurs at a polymorphic site occupied by a single nucleotide, which is the site of variation between allelic sequences. The site is usually preceded by and followed by highly conserved sequences of the allele (e.g., sequences that vary in less than 1/100 or 1/1000 members of the populations).
A single nucleotide polymorphism usually arises due to substitution of one nucleotide for another at the polymorphic site. A transition is the replacement of one purine by another purine or one pyrimidine by another pyrimidine. A transversion is the replacement of a purine by a pyrimidine or vice versa. Single nucleotide polymorphisms can also arise from a deletion of a nucleotide or an insertion of a nucleotide relative to a reference allele.