Many investigators are presently working on the problem of identifying and quantifying the concentration of specific genes or polynucleotide sequences within a larger population of sequences. As an example, consider the problem of measuring for the presence and concentration of actin gene mRNA transcripts in a cell's RNA population, or the concentration of the actin cDNA gene in a population of cDNA molecules, or the identity and concentration of the actin gene in a genome. The problem becomes more complex when one wants to measure not only the actin gene but rather to identify and quantify every possible gene or genetic sequence in the population. The population could be immensely complex and have on the order of 10.sup.5 different (i.e., distinct) DNA sequences. RNA populations can be dealt with by convening to a cDNA population using oligo dT primers or random primers. Then the RNA problem becomes effectively the same as the DNA problem.
Typically, the length of each polynucleotide sequence is from 10.sup.2 to 10.sup.4 bases in length--probably with a mean of about 10.sup.3 bases. The number of different sequences in a cell's mRNA population is likely to be less than 10.sup.5 (the estimated number of genes), but in about the 10.sup.4 range. The typical case would be a population of about 10.sup.4 different DNA sequences with each sequence being about 10.sup.3 bases long. However, the complexity of a particular polynucleotide population may actually be much higher.
Strategies have been suggested to provide less cumbersome ways to analyze these large populations of polynucleotides. For example, Velculescu et al., Science 270, 484, 1995, propose a regimen for serial analysis of gene expression (or "SAGE"). SAGE allows the construction of a more uniform population of unique polynucleotides from the larger population which is to be analyzed. SAGE employs a combination of a "sampling" restriction enzyme (having a 4 bp recognition site) and a "tagging" restriction enzyme (a type IIs restriction enzyme) to produce unique 13-base tags (a 4-base common sequence combined with a 9-base variable unique sequence) from each polynucleotide in the population under analysis. These tags are then concatamerized and sequenced to determine the identity of the tags. The sequenced tags can then be compared with known sequences to determine which gene or sequences were in the analyzed population. However, the SAGE technology is still a very cumbersome process. Sequence analysis of the thousands concatamerized tags can be a labor- and resource intensive process, thus limiting the applicability of the process where speed and ease of analysis is desirable or required.
Therefore, it would be desirable to provide more efficient and improved methods for decreasing the complexity of a polynucleotide population which is to be analyzed. It is also desirable to provide methods for characterizing a population of polynucleotides by sampling only part of each polynucleotide. It would also be desirable to provide methods for producing unique sequence tags from a large population under analysis and to provide efficient methods for analysis of such tags.