The disclosed invention is generally in the field of nucleic acid characterization and analysis, and specifically in the area of analysis and comparison of gene expression patterns and genomes.
The study of differences in gene-expression patterns is one of the most promising approaches for understanding mechanisms of differentiation and development. In addition, the identification of disease-related target molecules opens new avenues for rational pharmaceutical intervention. Currently, there are two main approaches to the analysis of molecular expression patterns: (1) the generation of mRNA-expression maps and (2) examination of the xe2x80x98proteomexe2x80x99, in which the expression profile of proteins is analyzed by techniques such as two-dimensional gel electrophoresis, mass spectrometry [matrix-assisted-desorption-ionization-time-of-flight (MALDI-TOF) or electrospray] and by the ability to sequence sub-picomole amounts of protein. Classical approaches to transcript imaging, such as northern blotting or plaque hybridization, are time-consuming and material-intensive ways to analyze mRNA-expression patterns. For these reasons, other methods for high-throughput screening in industrial and clinical research have been developed.
A breakthrough in the analysis of gene expression was the development of the northern-blot technique in 1977 (Alwine et al., Proc. NatL. Acad. Sci. U.S.A. 74:5350-5354 (1977)). With this technique, labeled cDNA or RNA probes are hybridized to RNA blots to study the expression patterns of mRNA transcripts. Alternatively, RNase-protection assays can detect the expression of specific RNAs. These assays allow the expression of mRNA subsets to be determined in a parallel manner. For RNase-protection assays, the sequence of the analyzed mRna has to be known in order to synthesize a labeled cDNA that forms a hybrid with the selected mRNA; such hybrids resist RNA degradation by a single-strand-specific nuclease and can be detected by gel electrophoresis. As a third approach, differential plaque-filter hybridization allows the identification of specific differences in the expression of cloned cDNAs (Maniatis et al Cell 15:687-701 (1978)). Although all of these techniques are excellent tools for studying differences in gene expression, the limiting factor of these classical methods is that expression patterns can be analyzed only for known genes.
The analysis of gene-expression patterns made a significant advance with the development of subtractive cDNA libraries, which are generated by hybridizing an mRna pool of one origin to an mRNA pool of a different origin. Transcripts that do not find a complementary strand in the hybridization step are then used for the construction of a cDNA library (Hedrick et al., Nature 308:149-153 (1984)). A variety of refinements to this method have been developed to identify specific mRNAs (Swaroop et al., Nucleic Acids Res. 25:1954 (1991); Diatchenko et at, Proc. Natl. Acad. Sci. U.S.A 93:6025-6030 (1996)). One of these is the selective amplification of differentially expressed mRNAs via biotin- and restriction-mediated enrichment (SABRE; Lavery et al., Proc. Natl. Acad. Sci. U.S.A. 94:6831-6836 (1997)), cDNAs derived from a tester population are hybridized against the cDNAs of a driver (control) population. After a purification step specific for tester-cDNA-containing hybrids, testerxe2x80x94tester homohybrids are specifically amplified using an added linker, thus allowing the isolation of previously unknown genes.
The technique of differential display of eukaryotic mRNA was the first one-tube method to analyze and compare transcribed genes systematically in a bi-directional fashion; subtractive and differential hybridization techniques have only been adapted for the unidirectional identification of differentially expressed genes (Liang and Pardee, Science 257:967-971 (1992)). Refinements have been proposed to strengthen reproducibility, efficiency, and performance of differential display (Bauer et al., Nucleic Acids Res. 11:4272-4280 (1993); Liang and Pardee, Curr. Opin. Immunol 7:274-280 (1995); Ito and Sakaki, Methods Mol. Biol. 85:37-44 (1997); Praschar and Weissman, Proc. Natl. Acad. Sci U.S.A. 93;659-663 (1996)). Although these approaches are more reproducible and precise than traditional PCR-based differential display, they still require the use of gel electrophoresis, and often implies the exclusion of certain DNA fragments from analysis.
Originally developed to identify differences between two complex genomes, representational difference analysis (RDA) was adapted to analyze differential gene expression by taking advantage of both subtractive hybridization and PCR (Lisitsyn et al., Science 259:946-951 (1993); Hubank and Schatz, Nucleic Acids Res. 22:5640-5648 (1994)). In the first step, mRNA derived from two different populations, the tester and the driver (control), is reverse transcribed; the tester cDNA represents the cDNA population in which differential gene expression is expected to occur. Following digestion with a frequently cutting restriction endonuclease, linkers are ligated to both ends of the cDNA. A PCR step then generates the initial representation of the different gene pools. The linkers of the tester and driver cDNA are digested and a new linker is ligated to the ends of the tester cDNA. The tester and driver cDNAs are then mixed in a 1:100 ratio with an excess of driver cDNA in order to promote hybridization between single-stranded cDNAs common in both tester and driver cDNA pools. Following hybridization of the cDNAs, a PCR exponentially amplifies only those homoduplexes generated by the tester cDNA, via the priming sites on both ends of the double-stranded cDNA (O""Neill and Sinclair, Nucleic Acids Res. 25:2681-2682 (1997); Wada et al., Kidney Int. 51:1629-1638 (1997); Edman et al., J. 323:113-118 (1997)).
The gene-expression pattern of a cell or organism determines its basic biological characteristics. In order to accelerate the discovery and characterization of mRNA-encoding sequences, the idea emerged to sequence fragments of cDNA randomly, direct from a variety of tissues (Adams et al., Science 252:1651-1656 (1991); Adams et al., Nature 377:3-16 (1995)). These expressed sequence tags (ESTs) allow the identification of coding regions in genome-derived sequences. Publicly available EST databases allow the comparative analysis of gene expression by computer. Differentially expressed genes can be identified by comparing the databases of expressed sequence tags of a given organ or cell type with sequence information from a different origin (Lee et al., Proc. NatL. Acad. Sci. U.S.A. 92:8303-8307 (1995); Vasmatzis et al., Proc. Natl. Acad. Sci. U.S.A. 95:300-304 (1998)). A drawback to sequencing of ESTs is the requirement for large-scale sequencing facilities.
Serial analysis of gene expression (SAGE) is a sequence-based approach to the identification of differentially expressed genes through comparative analyses (Velculescu et al., Science 270:484-487 (1995)). It allows the simultaneous analysis of sequences that derive from different cell population or tissues. Three steps form the molecular basis for SAGE: (1) generation of a sequence tag (10-14 bp) to identify expressed transcripts; (2) ligation of sequence tags to obtain concatemers that can be cloned and sequenced; and (3) comparison of the sequence data to determine differences in expression of genes that have been identified by the tags. This procedure is performed for every mRNA population to be analyzed. A major drawback of SAGE is the fact that corresponding genes can be identified only for those tags that are deposited in gene banks, thus making the efficiency of SAGE dependent on the extent of available databases. Alternatively, a major sequencing effort is required to complete a SAGE data set capable of providing 95% coverage of any given mRNA population, simply because most of the sequencing work yields repetitive reads on those tags that are present in high frequency in cellular mRNA. In other words, SAGE sequencing experiments yield diminishing returns for rare mRNAs, whose unique tags will begin to accumulate in the database only after many weeks of sequencing effort.
A different approach to the study of gene-expression profiles and genome composition is the use of DNA microarrays. Current DNA microarrays are systematically gridded at high density. Such microarrays are generated by using cDNAs (for example, ESTs), PCR products or cloned DNA, which are linked to the surface of nylon filters, glass slides or silicon chips (Schena et al., Science 270, 467-470 (1995)). DNA arrays can also be assembled from synthetic oligonucleotides, either by directly applying the synthesized oligonucleotides, either by directly applying the synthesized oligonucleotides to the matrix or by a more sophisticated method that combines photolithography and solid-phase chemical synthesis (Fodor et al., Nature 364:555-556 (1993)). To determine differences in gene-expression, labeled cDNAs or oligonucleotides are hybridized to the DNA- or oligomer-carrying arrays. When using different fluorophores for labeling cDNAs or oligonucleotides, two probes can be applied simultaneously to the array and compared at different wavelengths. The expression of 10,000 genes and more can be analyzed on a single chip (Chee et al., Science 274:610-614 (1996)). However, depending on the sensitivity of both cDNA and oligonucleotide arrays, the intensity of hybridization signals can leave the linear range when either weakly or abundantly expressed genes are analyzed. Thus, individual optimization steps are required to ensure the accurate detection of differentially expressed genes. While such microarray methods may be used to address a number of interesting biological questions, they are not suitable for the discovery of new genes.
There is a need for a method that combines the power and convenience of array hybridization technology with the capability for gene discovery inherent in differential display or SAGE. Such a method would be most attractive if it could enable comprehensive gene expression analysis without the use of gel electrophoresis, and without the need for a redundant DNA sequencing effort.
Therefore, it is an object of the present invention to provide a method for the comprehensive analysis of nucleic acid sequence tags.
It is another object of the present invention to provide a detector composition that allows indexing of nucleic acid sequence tags.
Disclosed is a method for the comprehensive analysis of nucleic acid samples and a detector composition for use in the method. The method, referred to as Fixed Address Analysis of Sequence Tags (FAAST), involves generation of a set of nucleic acid fragments having a variety of sticky end sequences; indexing of the fragments into sets based on the sequence of sticky ends; associating a detector sequence with the fragments; sequence-based capture of the indexed fragments on a detector array; and detection of the fragment labels. Generation of the multiple sticky end sequences is accomplished by incubating the nucleic acid sample with one or more nucleic acid cleaving reagents. Preferably this is accomplished by subjecting the nucleic acid sample to digestion by a restriction endonuclease that cleaves at a site different from the recognition sequence, or by multiple restriction endonucleases. The indexed fragments are captured by hybridization and coupling, preferably by ligation, to a probe. The probe is preferably immobilized in an array or on sortable beads.
The method allows detection of the indexed fragments where detection provides some sequence information for the fragments including the sequence of the original sticky end of each fragment, the recognition sequence of the restriction endonuclease (if different from the sticky end sequence), and the sequence corresponding to the probe. The method allows a complex sample of nucleic acid to be cataloged quickly and easily in a reproducible and sequence-specific manner.
One form of the FAAST method, referred to as variable address analysis of sequence tags (VAAST) allows determination of associations, in a nucleic acid molecule, of different combinations of known or potential sequences. For example, particular combinations of joining and variable regions in immunoglobulins or T cell receptors can be determined. Another form of the FAAST method, referred to as modification assisted analysis of sequence tags (MAAST), assesses modification of sequences in nucleic acid molecules by basing cleavage of the molecules on the presence or absence of modification. For example, a site that is methylated in a nucleic acid molecule will not be cut by a restriction enzyme that is sensitive to methylation at that site. A restriction enzyme that is insensitive to methylation will cleave at that site, thus producing a different pattern of sequence tags.