The disclosed invention is generally in the field of nucleic acid characterization and analysis, and specifically in the area of analysis and comparison of gene expression patterns, nucleic acid samples, and genomes.
The study of differences in gene-expression patterns is one of the most promising approaches for understanding mechanisms of differentiation and development. In addition, the identification of disease-related target molecules opens new avenues for rational pharmaceutical intervention. Currently, there are two main approaches to the analysis of molecular expression patterns: (1) the generation of mRNA-expression maps and (2) examination of the xe2x80x98proteomexe2x80x99, in which the expression profile of proteins is analyzed by techniques such as two-dimensional gel electrophoresis, mass spectrometry [matrix-assisted-desorption-ionization-time-of-flight (MALDI-TOF) or electrospray] and by the ability to sequence sub-picomole amounts of protein. Classical approaches to transcript imaging, such as northern blotting or plaque hybridization, are time-consuming and material-intensive methods to analyze mRNA-expression patterns. For these reasons, other methods for high-throughput screening in industrial and clinical research have been developed.
A breakthrough in the analysis of gene expression was the development of the northern-blot technique in 1977 (Alwine et al., Proc. Natl. Acad. Sci. U.S.A. 74:5350-5354 (1977)). With this technique, labeled cDNA or RNA probes are hybridized to RNA blots to study the expression patterns of mRNA transcripts. Alternatively, RNase-protection assays can detect the expression of specific RNAs. These assays allow the expression of mRNA subsets to be determined in a parallel manner. For RNase-protection assays, the sequence of the analyzed mRNA has to be known in order to synthesize a labeled cDNA that forms a hybrid with the selected mRNA; such hybrids resist RNA degradation by a single-strand-specific nuclease and can be detected by gel electrophoresis. As a third approach, differential plaque-filter hybridization allows the identification of specific differences in the expression of cloned cDNAs (Maniatis et al. Cell 15:687-701 (1978)). Although all of these techniques are excellent tools for studying differences in gene expression, the limiting factor of these classical methods is that expression patterns can be analyzed only for known genes.
The analysis of gene-expression patterns made a significant advance with the development of subtractive cDNA libraries, which are generated by hybridizing an mRNA pool of one origin to an mRNA pool of a different origin. Transcripts that do not find a complementary strand in the hybridization step are then used for the construction of a cDNA library (Hedrick et al., Nature 308:149-153(1984)). A variety of refinements to this method have been developed to identify specific mRNAs (Swaroop et al., Nucleic Acids Res. 25:1954 (1991); Diatchenko et al, Proc. Natl. Acad. Sci. U.S.A. 93:6025-6030 (1996)). One of these is the selective amplification of differentially expressed mRNAs via biotin- and restriction-mediated enrichment (SABRE; Lavery et al., Proc. Natl. Acad. Sci. U.S.A. 94:6831-6836 (1997)), cDNAs derived from a tester population are hybridized against the cDNAs of a driver (control) population. After a purification step specific for tester-cDNA-containing hybrids, tester-tester homohybrids are specifically amplified using an added linker, thus allowing the isolation of previously unknown genes.
The technique of differential display of eukaryotic mRNA was the first one-tube method to analyze and compare transcribed genes systematically in a bi-directional fashion; subtractive and differential hybridization techniques have only been adapted for the unidirectional identification of differentially expressed genes (Liang and Pardee, Science 257:967-971 (1992)). Refinements have been proposed to strengthen reproducibility, efficiency, and performance of differential display (Bauer et al., Nucleic Acids Res. 11:4272-4280 (1993); Liang and Pardee, Curr. Opin. Immunol 7:274-280 (1995); Ito and Sakaki, Methods Mol. Biol. 85:37-44 (1997); Praschar and Weissman, Proc. Natl. Acad Sci U.S.A. 93;659-663 (1996), Shimkets et al., Nat Biotechnol, 17: 798-803 (1999)). Although these approaches are more reproducible and precise than traditional PCR-based differential display, they still require the use of gel electrophoresis. This often implies the exclusion of certain DNA fragments from analysis.
Originally developed to identify differences between two complex genomes, representational difference analysis (RDA) was adapted to analyze differential gene expression by taking advantage of both subtractive hybridization and PCR (Lisitsyn et al., Science 259:946-951 (1993); Hubank and Schatz, Nucleic Acids Res. 22:5640-5648 (1994)). In the first step, mRNA derived from two different populations, the tester and the driver (control), is reverse transcribed; the tester cDNA represents the cDNA population in which differential gene expression is expected to occur. Following digestion with a frequently cutting restriction endonuclease, linkers are ligated to both ends of the cDNA. A PCR step then generates the initial representation of the different gene pools. The linkers of the tester and driver cDNA are digested and a new linker is ligated to the ends of the tester cDNA. The tester and driver cDNAs are then mixed in a 1:100 ratio with an excess of driver cDNA in order to promote hybridization between single-stranded cDNAs common in both tester and driver cDNA pools. Following hybridization of the cDNAs, a PCR exponentially amplifies only those homoduplexes generated by the tester cDNA, via the priming sites on both ends of the double-stranded cDNA (O""Neill and Sinclair, Nucleic Acids Res. 25:2681-2682 (1997); Wada et al., Kidney Int. 51:1620-1628 (1997); Edman et al., J. 323:112-118 (1997). biological characteristics. In order to accelerate the discovery and characterization of mRNA-encoding sequences, the idea emerged to sequence fragments of cDNA randomly, direct from a variety of tissues (Adams et al., Science 252:1651-1656 (1991); Adams et al., Nature 377:3-16 (1995)). These expressed sequence tags (ESTs) allow the identification of coding regions in genome-derived sequences. Publicly available EST databases allow the comparative analysis of gene expression by computer. Differentially expressed genes can be identified by comparing the databases of expressed sequence tags of a given organ or cell type with sequence information from a different origin (Lee et al., Proc. Natl. Acad. Sci. U.S.A. 92:8303-8307 (1995); Vasmatzis et al., Proc. Natl. Acad. Sci. U.S.A. 95:300-304 (1998)). A drawback to sequencing of ESTs is the requirement for large-scale sequencing facilities.
Serial analysis of gene expression (SAGE) is a sequence-based approach to the identification of differentially expressed genes through comparative analyses (Velculescu et al., Science 270:484-487 (1995)). It allows the simultaneous analysis of sequences that derive from different cell population or tissues. Three steps form the molecular basis for SAGE: (1) generation of a sequence tag (10-14 bp) to identify expressed transcripts; (2) ligation of sequence tags to obtain concatemers that can be cloned and sequenced; and (3) comparison of the sequence data to determine differences in expression of genes that have been identified by the tags. This procedure is performed for every mRNA population to be analyzed. A major drawback of SAGE is the fact that corresponding genes can be identified only for those tags that are deposited in gene banks, thus making the efficiency of SAGE dependent on the extent of available databases. Alternatively, a major sequencing effort is required to complete a SAGE data set capable of providing 95% coverage of any given mRNA population, simply because most of the sequencing work yields repetitive reads on those tags that are present at high frequency in cellular mRNA. In other words, SAGE sequencing experiments yield diminishing returns for rare mRNAs, whose unique tags will begin to accumulate in the database only after many weeks of sequencing effort.
A different approach to the study of gene-expression profiles and genome composition is the use of DNA microarrays. Current DNA microarrays are systematically gridded at high density. Such microarrays are generated by using cDNAs (for example, ESTs), PCR products or cloned DNA, which are linked to the surface of nylon filters, glass slides or silicon chips (Schena et al., Science 270, 467-470 (1995)). DNA arrays can also be assembled from synthetic oligonucleotides, either by directly applying the synthesized oligonucleotides to the matrix or by a more sophisticated method that combines photolithography and solid-phase chemical synthesis (Fodor et al., Nature 364:555-556 (1993)). To determine differences in gene-expression, labeled cDNAs or oligonucleotides are hybridized to the DNA- or oligomer-carrying arrays. When using different fluorophores for labeling cDNAs or oligonucleotides, two probes can be applied simultaneously to the array and compared at different wavelengths. The expression of 10,000 genes and more can be analyzed on a single chip (Chee et al., Science 274:610-614 (1996)). However, depending on the sensitivity of both cDNA and oligonucleotide arrays, the intensity of hybridization signals can leave the linear range when either weakly or abundantly expressed genes are analyzed. Thus, individual optimization steps are required to ensure the accurate,detection of differentially expressed genes. While such microarray methods may be used to address a number of interesting biological questions, they are not suitable for the discovery of new genes.
Techniques of tagging DNA fragments using sticky end-specific adaptors have been described by Burger and Schinzel, Mol. Gen. Genet. 189:269-274 (1983), Mandecki and Bolling, Gene, 68:101-107 (1988), Posfai and Szybalski, Gene, 74:179-181 (1988), Urlaub et al, Proc. Natl. Acad. Sci., 82:1189-1193 (1985), Vermesch and Bennett, Gene, 54:229-238 (1987), Unrau and Deugau, Gene, 145(2):163-9 (1994)). These techniques all involve the use of existing restriction sites and produce tagged fragments of various lengths.
There is a need for a method that combines the power and convenience of array hybridization technology with the capability for gene discovery inherent in differential display or SAGE. Such a method would be most attractive if it could enable comprehensive gene expression analysis without the use of gel electrophoresis, and without the need for a redundant DNA sequencing effort.
Therefore, it is an object of the present invention to provide a method for the comprehensive analysis of nucleic acid sequence tags.
It is another object of the present invention to provide a detector composition that allows indexing of nucleic acid sequence tags.
It is another object of the present invention to provide catalogs of sequence tags from nucleic acid samples.
Disclosed is a method for the comprehensive analysis of nucleic acid samples and a detector composition for use in the method. The method, referred to as Binary Encoded Sequence Tags (BEST), involves generation of a set of nucleic acid fragments; adding an adaptor to the ends containing a recognition site for cleavage at a site offset from the recognition site; cleaving the fragment to generate fragments having a plurality of sticky ends; indexing of the fragments into sets based on the sequence of sticky ends. Multiple sticky end sequences are generated by virtue of offset cleavage using the recognition site added as part of the adaptor. Preferably this is accomplished by subjecting the nucleic acid sample to digestion by a restriction endonuclease that cleaves at a site different from the site of the recognition sequence. The fragments are indexed by adding an offset adaptor to newly generated ends. A different adaptor will be coupled to each different sticky end. The resulting fragmentsxe2x80x94which will have defined ends, are of equal lengths (in a preferred embodiment), and a central sequence derived from the source nucleic acid moleculexe2x80x94are binary sequence tags. The binary sequence tags can be used and further analyzed in numerous ways. For example, the binary sequence tags can be captured by hybridization and coupling, preferably by ligation, to a probe. The probe is preferably immobilized in an array or on sortable beads. The disclosed method differs from prior methods at least since the present method introduces an offset cleavage site into target nucleic fragment. This has the advantage that sets of sequence tags are generated that have defined lengths.
The method allows detection of the binary sequence tags where detection provides some sequence information for the tags including the sequence of the generated sticky end of each fragment, the recognition sequence of the nucleic acid cleaving reagentxe2x80x94preferably a restriction endonucleasexe2x80x94used to initially cleave nucleic acid molecules, and the central sequence of the tag. The set of binary sequence tags produced from a nucleic acid sample using particular nucleic acid cleaving reagents and adaptors will produce characteristic sets of binary sequence tags. The method allows a complex sample of nucleic acid to be cataloged quickly and easily in a reproducible and sequence-specific manner. The disclosed method also should produce two binary sequence tags for each cleavage site in the nucleic acid sample. This can allow comparisons and validation of a set of binary sequence tags.
One form of the BEST method, referred to as modification assisted analysis of binary sequence tags (MAABST), assesses modification of sequences in nucleic acid molecules by detecting differential cleavage based on the presence or absence of modification in the molecules. For example, a site that is methylated in a nucleic acid molecule will not be cut by a restriction enzyme that is sensitive to methylation at that site. A restriction enzyme that is insensitive to methylation will cleave at that site, thus producing a different pattern of binary sequence tags.