Many scientific and commercial endeavors, particularly in genetics and drug discovery, require rapid and efficient analysis of large sets of molecules, such as libraries of organic compounds, complementary DNAs (cDNAs), genomic fragments, and the like. For example, in genetics, unraveling the genetic basis of complex traits remains an unsolved problem of immense medical and economic importance. One approach to this problem is to carry out trait-association studies in which a large set of genetic markers from populations of affected and unaffected individuals are compared. Such studies depend on the non-random segregation, or linkage disequilibrium, between the genetic markers and genes involved in the trait or disease being studied. Unfortunately, the extent and distribution of linkage disequilibrium between regions of the human genome is not well understood, but it is currently believed that successful trait-association studies in humans would require the measurement of 30-50,000 markers per individual in populations of at least 300-400 affected individuals and an equal number of controls, Kruglyak and Nickerson, Nature Genetics, 27: 234-236 (2001); Lai, Genome Research, 11: 927-929 (2001); Risch and Merikangas, Science, 273: 1516-1517 (1996); Cardon and Bell, Nature Reviews Genetics, 2: 91-99 (2001). The cost of such studies using current technology is staggering, Weaver, Trends in Genetics, pgs. 36-41 (December, 2000).
In the area of drug discovery, business imperatives and advances in biotechnology, such as the availability of genomic sequences, high throughput gene expression analysis, proteomics, and bioinformatics, have created a need for efficient large-scale methods for identifying potential drug targets, validating targets, and identifying lead compounds, e.g. Myers and Baker, Nature Biotechnology, 19: 727-730 (2001). Such methods should have the capability to analyze simultaneously tens of thousands of compounds, or more, with minimal handling. For example, it is estimated that there are between 30-35,000 genes in the human genome and that as many as thirty-five percent of expressed genes appear in multiple forms due to alternative transcript splicing or other post-transcriptional processing events, e.g. Mironov et al, Genome Research, 9: 1288-1293 (1999)(alternative splicing); Beaudoing et al, Genome Research, 10: 1001-1010 (2000)(variant polyadenylation). Moreover, proteins expressed from such gene products are subject to a wide variety of post-translational modifications, e.g. Han et al, Int. J. Biochem., 24: 19-28 (1992). Even if only a few dozen of these gene products are eventually are identified as validated targets for a particular disease, lead compounds must still be selected from many hundreds of thousands candidate molecules, followed by lead optimization.
In the pharmaceutical, chemical and biotechnical fields, molecular tagging strategies have been proposed as a means for efficiently analyzing large numbers of analytes in a single assay reaction, e.g. Brenner, U.S. Pat. No. 5,763,175 (DNA sequencing); Lerner et al, U.S. Pat. No. 6,060,596 (combinatorial libraries); Giese, U.S. Pat. No. 5,360,819 (chemical analysis); Church et al, U.S. Pat. No. 4,942,124 (DNA sequencing); Sill et al, U.S. Pat. No. 5,565,324 (combinatorial libraries); Southern et al, U.S. Pat. No. 6,218,111 (mass tag labels for oligonucleotides) Van Ness et al, U.S. Pat. No. 6,312,893 (mass tags for genotyping); Schoemaker et al, European Pat. Publ. EP 0799897A1 (tracking yeast mutants); Fan et al, PCT publ. WO 00/58516 (genotyping); Wolber et al, U.S. Pat. No. 6,235,483 (labeling cDNAs); Taylor et al, Biotechniques, 30: 661-669 (2000)(“fluid” arrays); and the like. In most approaches, an analytical reaction is followed by a readout that involves spatial separation of the molecular tags, for example, by mass spectrometry, electrophoresis, hybridization to solid phase supports, or the like. A common difficulty of large-scale tagging approaches is associating a particular tag with a particular analyte or reaction. The only exception is the method of Brenner (U.S. Pat. No. 5,763,175) which attaches tags to polynucleotide analytes by sampling procedure and does not require the identity of the tags for a readout. The usual approach is to prepare each tag and its corresponding analyte interacting moiety, e.g. a locus-specific primer, or the like, in a separate batch reaction and then to mix the conjugates prior to a multiplexed assay, e.g. Fan et al, Genome Research, 10: 853-860 (2000); Chen et al, Genome Research, 10: 549-557 (2000); and the like. This is a serious impediment to the efficient large-scale use of tags in multiplexed analyses.
In some systems, tags have been synthesized by combinatorial methods in order to efficiently generate large sets, e.g. Lerner et al (cited above); Dower et al, U.S. Pat. No. 5,770,358; and Brenner et al U.S. Pat. No. 5,763,175. However, such systems require that selected subsets of tags be individually decoded so that the analytes of interest can be identified. In Brenner's system, the decoding is accomplished by hybridizing copies of the tags to an array of tag complements. Even though individual “words” making up the tags are minimally cross-hybridizing, the tags as a whole are capable of forming spurious duplexes with unintended complements when an N-word tag forms a perfectly match duplex with N-1 consecutive words of a complement. Such spurious duplexes could be avoided by using tags consisting of “words” that make up a so-called “comma-less” code, e.g. Crick et al, Proc. Natl. Acad. Sci., 43: 416-421 (1957).
In view of the above, many fields, such as medical and industrial genetics, drug discovery, and the like, would benefit by the availability of a versatile high throughput platform for carrying out a multitude of different analytical assays. In particular, many advantages would accrue from a tag-based analytical platform that (i) provided analytical reagents using existing microarray technology, (ii) employed a common microarray-based readout, (iii) permitted the simultaneous synthesis of large numbers of tag-analyte interaction moieties in the same reaction, and (iv) used combinatorial tags made of words having the “comma-less” property. Such advantages include the economies of high volume production, use of the widespread expertise in microarray technology, and use of the installed base of microarray analyzers.