Recent research has revealed that the number of genes present in the human genome is between 30,000 and 50,000, and that the number of genes present in the genome of Arabidopsis, the first higher plant genome to be fully sequenced, is around 40,000 (Mayer, K., et al. Nature. Dec. 16, 1999;402(6763): 731–2). The genomes of simpler organisms are smaller, ranging down to a few thousand genes in bacteria, such as Escherichia coli. 
Multicellular organisms contain many different cell types, each with different functional characteristics. All of these different cell types contain the same databank of genetic information and roughly the same number of genes. Thus, for example, if one cell derived from an organism has a genetic complement made up of 40,000 genes then most all cells derived from the same organism are likely to have the same set of 40,000 genes. However, not all of these genes are expressed in every cell type of the organism. Certain combinations of genes are expressed in different cell types. For example, in animals, a certain combination of genes is expressed in liver cells, a different combination of genes in brain cells, etc. It is the profile of genes expressed in any given cell type that determines the functional characteristics of that cell type. Likewise, cells adapt their style of functioning to respond to inputs from the environment and from other parts of the physiology of the organism. In most cases this adaptation involves turning some genes on and other genes off. For instance, a normal liver cell will express a very different set of genes than would be expressed by a liver tumor cell, although some genes may be expressed by most or all cells of a given multicellular organism. Furthermore, environmental effects, such as extracellular signals, can induce a change in the number and types of, or the relative levels of genes expressed in a given cell type.
Understanding which genes are expressed in which cell types and under which conditions is key to understanding living systems. More practically, a knowledge of which genes are expressed at higher or lower levels in disease states provides data that can be extremely valuable in identifying new drug targets, and ultimately new pharmaceuticals or other therapeutics. Information about gene expression is also useful in tailoring therapeutic approaches for individual patients based on their own genetic expression response to different pharmaceuticals, diet, or environmental stimuli.
Current approaches to measuring gene expression (e.g. Northern blot hybridzation, dot blot hybridization, in vitro translation and immunoprecipitation, and hybridization with microarrays constucted with probes designed to detect specific mRNA, usually require prior knowledge of what genes are of interest in order to design sequence-specific probes and primers (Sambrook, et al., Molecular Cloning: A Laboratory Manual, Third Edition, Cold Spring Harbor, N.Y. (2000)). These targeted approaches may miss important information about altered gene expression by never studying expression of unknown genes, or genes not thought to be of great importance. For example, in humans there are currently believed to be about 50,000 genes that can be expressed in any given cell type. Thus, to determine the active complement of genes in a given cellular genome, one needs to efficiently scan a given cell type quantitatively, or at least semi-quantitatively, for all of the active, i.e. expressed, genes.
A need thus exists in the field of gene expression analysis for methods and assay configurations to screen all active genes in different cells under different conditions or states for possible changes in gene expression. Systems are needed for monitoring and analysis of gene expression preferably screen gene expression universally without regard to cell type or species.
The use of very high density oligonucleotide arrays is one technology area that may be exploited for just such a universal system. U.S. Pat. No. 6,344,316 reports the use of such high density arrays for “generic difference screening” of gene expression. The screening method of this patent employs an array of oligonucleotide probes, wherein the location and sequence of each different probe is unique and known and wherein the probes are not chosen to hybridize to nucleic acid derived from particular pre-selected genes. Probe oligonucleotides are described as chosen by random selection, haphazard selection, nucleotide composition biased selection, or as all possible oligonucleotide combinations of a chosen oligonucleotide length. For example, 48, or 65,536 distinct, locatable array spots are required to encompass all possible permutations of 4 bases (A, G, C and T) in an 8 base oligonucleotide probe. While containing an enormous amount of information about the sequence of oligonucleotides bound to these probes, arrays comprising such a large number of different probe spots containing unique probes are very costly.
However, an exhaustive array that would be effective in stringent discrimination of single base pair mismatches would comprise a very large plurality of features. Specifically, oligonucleotides of between 11 and 20 base pairs in length would be required to achieve stringent discrimination of single base pair mismatches. An exhaustive microarray employing 11 base oligonucleotides would have 411=4.2×106 spots. An exhaustive microarray employing 20 base oligonucleotides would comprise 420=1.1×1012 spots. Both of these designs are far too large for practical use. Moreover, such microarrays would be prohibitively expensive to manufacture.
The present invention provides an integrated system of design for probes, primers, and microarrays, and strategies for labeling nucleic acids. The system generates microarrays that address the problems posed above. The microarrays of this invention (1) are capable of exhaustive screening of the gene expression profile of unknown cell type or organism, and (2) are economical to manufacture.