The desire to decode the human genome and to understand the genetic basis of disease and a host of other physiological states associated differential gene expression has been a key driving force in the development of improved methods for analyzing and sequencing DNA, Adams et al, Editors, Automated DNA Sequencing and Analysis (Academic Press, New York, 1994). The human genome is estimated to contain about 10.sup.5 genes, about 15-30% of which--or about 4-8 megabases--are active in any given tissue. Such large numbers of expressed genes make it difficult to track changes in expression patterns by available techniques, such as with hybridization of gene products to microarrays, direct sequence analysis, or the like. More commonly, expression patterns are initially analyzed by lower resolution techniques, such as differential display, indexing, subtraction hybridization, or one of the numerous DNA fingerprinting techniques, e.g. Vos et al, Nucleic Acids Research, 23: 4407-4414 (1995); Hubank et al, Nucleic Acids Research, 22: 5640-5648 (1994); Lingo et al, Science, 257: 967-971 (1992); Erlander et al, International patent application PCT/US94/13041; McClelland et al, U.S. Pat. No. 5,437,975; Unrau et al, Gene, 145: 163-169 (1994); Hubank et al, Nucleic Acids Research, 22: 5640-5648 (1994); Geng et al, BioTechniques, 25: 434-438 (1998); and the like. Higher resolution analysis is then frequently carried out on subsets of cDNA clones identified by the application of such techniques, e.g. Linskens et al, Nucleic Acids Research, 23: 3244-3251 (1995).
Recently, two techniques have been implemented that attempt to provide direct sequence information for analyzing patterns of gene expression. One involves the use of microarrays of oligonucleotides or polynucleotides for capturing complementary polynucleotides from expressed genes, e.g. Schena et al, Science, 270: 467-469 (1995); DeRisi et al, Science, 278: 680-686 (1997); Chee et al, Science, 274: 610-614 (1996); and the other involves the excision and concatenation of short sequence tags from cDNAs, followed by conventional sequencing of the concatenated tags, i.e. serial analysis of gene expression (SAGE), e.g. Velculescu et al, Science, 270: 484-486 (1995); Zhang et al, Science, 276: 1268-1272 (1997); Velculescu et al, Cell, 88: 243-251 (1997). Both techniques have shown promise as potentially robust systems for analyzing gene expression; however, there are still technical issues that need to be addressed for both approaches. For example, in microarray systems, genes to be monitored must be known and isolated beforehand, and with respect to current generation microarrays, the systems lack the complexity to provide a comprehensive analysis of mammalian gene expression, they are not readily re-usable, and they require expensive specialized data collection and analysis systems, although these of course may be used repeatedly. In sequence tag systems, although no special instrumentation is necessary and an extensive installed base of DNA sequencers may be used, the selection of type IIs tag-generating enzymes is limited, and the length (nine nucleotides) of the sequence tag in current protocols severly limits the number of cDNAs that can be uniquely labeled. It can be shown that for organisms expressing large sets of genes, such as mammalian cells, the likelihood of nine-nucleotide tags being distinct for all expressed genes is extremely low, e.g. Feller, An Introduction to Probability Theory and Its Applications, Second Edition, Vol. I (John Wiley & Sons, New York, 1971).
It is clear from the above that there is a need for a technique to analyze gene expression that allows both the analysis of unknown genes and the unequivocal assignment of a sequence tag to an expressed gene. The availability of such techniques would find immediate application in medical and scientific research, drug discovery, and genetic analysis in a host of applied fields, such as pest management and crop and livestock development.