The invention relates generally to methods for identifying, sorting, and/or tracking molecules, especially polynucleotides, with oligonucleotide tags, and more particularly, to a method of sorting and analyzing such tagged polynucleotides by specific hybridization of the tags to their complements.
Specific hybridization of oligonucleotides and their analogs is a fundamental process that is employed in a wide variety of research, medical, and industrial applications, including the identification of disease-related polynucleotides in diagnostic assays, screening for clones of novel target polynucleotides, identification of specific polynucleotides in blots of mixtures of polynucleotides, amplification of specific target polynucleotides, therapeutic blocking of inappropriately expressed genes, DNA sequencing, and the like, e.g. Sambrook et al, Molecular Cloning: A Laboratory Manual, 2nd Edition (Cold Spring Harbor Laboratory, New York, 1989); Keller and Manak, DNA Probes, 2nd Edition (Stockton Press, New York, 1993); Milligan et al, J. Med. Chem., 36: 1923-1937 (1993); Drmanac et al, Science, 260: 1649-1652 (1993); Bains, J. DNA Sequencing and Mapping, 4: 143-150 (1993).
Specific hybridization has also been proposed as a method of tracking, retrieving, and identifying compounds labeled with oligonucleotide tags. For example, in multiplex DNA sequencing oligonucleotide tags are used to identify electrophoretically separated bands on a gel that consist of DNA fragments generated in the same sequencing reaction. In this way, DNA fragments from many sequencing reactions are separated on the same lane of a gel which is then blotted with separate solid phase materials on which the fragment bands from the separate sequencing reactions are visualized with oligonucleotide probes that specifically hybridize to complementary tags, Church et al, Science, 240: 185-188 (1988). Similar uses of oligonucleotide tags have also been proposed for identifying explosives, potential pollutants, such as crude oil, and currency for prevention and detection of counterfeiting, e.g. reviewed by Dollinger, pages 265-274 in Mullis et al, editors, The Polymerase Chain Reaction (Birkhauser, Boston, 1994. More recently, systems employing oligonucleotide tags have also been proposed as a means of manipulating and identifying individual molecules in complex combinatorial chemical libraries, for example, as an aid to screening such libraries for drug candidates, Brenner and Lerner, Proc. Natl. Acad. Sci., 89: 5381-5383 (1992); Alper, Science, 264: 1399-1401 (1994); and Needels et al, Proc. Natl. Acad. Sci., 90: 10700-10704 (1993).
The successful implementation of such tagging schemes depends in large part on the success in achieving specific hybridization between a tag and its complementary probe. That is, for an oligonucleotide tag to successfully identify a substance, the number of false positive and false negative signals must be minimized. Unfortunately, such spurious signals are not uncommon because base pairing and base stacking free energies vary widely among nucleotides in a duplex or triplex structure. For example, a duplex consisting of a repeated sequence of deoxyadenosine (A) and thymidine (T) bound to its complement may have less stability than an equal-length duplex consisting of a repeated sequence of deoxyguanosine (G) and deoxycytidine (C) bound to a partially complementary target containing a mismatch. Thus, if a desired compound from a large combinatorial chemical library were tagged with the former oligonucleotide, a significant possibility would exist that, under hybridization conditions designed to detect perfectly matched AT-rich duplexes, undesired compounds labeled with the GC-rich oligonucleotidexe2x80x94even in a mismatched duplexxe2x80x94would be detected along with the perfectly matched duplexes consisting of the AT-rich tag. In the molecular tagging system proposed by Brenner et al (cited above), the related problem of mis-hybridizations of closely related tags was addressed by employing a so-called xe2x80x9ccomma-lessxe2x80x9d code, which ensures that a probe out of register (or frame shifted) with respect to its complementary tag would result in a duplex with one or more mismatches for each of its five or more three-base words, or xe2x80x9ccodons.xe2x80x9d
Even though reagents, such as tetramethylammonium chloride, are available to negate base-specific stability differences of oligonucleotide duplexes, the effect of such reagents is often limited and their presence can be incompatible with, or render more difficult, further manipulations of the selected compounds, e.g. amplification by polymerase chain reaction (PCR), or the like.
Such problems have made the simultaneous use of multiple hybridization probes in the analysis of multiple or complex genetic loci, e.g. via multiplex PCR, reverse dot blotting, or the like, very difficult. As a result, direct sequencing of certain loci, e.g. HLA genes, has been promoted as a reliable alternative to indirect methods employing specific hybridization for the identification of genotypes, e.g. Gyllensten et al, Proc. Natl. Acad. Sci., 85: 7652-7656 (1988).
The ability to sort cloned and identically tagged DNA fragments onto distinct solid phase supports would facilitate such sequencing, particularly when coupled with a non gel-based sequencing methodology simultaneously applicable to many samples in parallel.
In view of the above, it would be useful if there were available an oligonucleotide-based tagging system which provided a large repertoire of tags, but which also minimized the occurrence of false positive and false negative signals without the need to employ special reagents for altering natural base pairing and base stacking free energy differences. Such a tagging system would find applications in many areas, including construction and use of combinatorial chemical libraries, large-scale mapping and sequencing of DNA, genetic identification, medical diagnostics, and the like.
An object of my invention is to provide a molecular tagging system for tracking, retrieving, and identifying compounds.
Another object of my invention is to provide a method for sorting identical molecules, or subclasses of molecules, especially polynucleotides, onto surfaces of solid phase materials by the specific hybridization of oligonucleotide tags and their complements.
A further object of my invention is to provide a method for analyzing gene expression patterns in diseased and normal tissues.
A still further object of my invention is to provide a system fbr tagging and sorting many thousands of fragments, especially randomly overlapping fragments, of a target polynucleotide for simultaneous analysis and/or sequencing.
Another object of my invention is to provide a rapid and reliable method for sequencing target polynucleotides having a length in the range of a few hundred basepairs to several tens of thousands of basepairs.
A further object of my invention is to provide a method for reducing the number of separate template preparation steps required in large scale sequencing projects employing conventional Sanger-based sequencing techniques.
My invention achieves these and other objects by providing a method and materials for tracking, identifying, and/or sorting classes or subpopulations of molecules by the use of oligonucleotide tags. An important feature of the invention is that the oligonucleotide tags are members of a minimally cross-hybridizing set of oligonucleotides. The sequences of oligonucleotides of such a set differ from the sequences of every other member of the same set by at least two nucleotides. Thus, each member of such a set cannot form a duplex (or triplex) with the complement of any other member with less than two mismatches. Complements of oligonucleotide tags of the invention, referred to herein as xe2x80x9ctag complements,xe2x80x9d may comprise natural nucleotides or non-natural nucleotide analogs. Preferably, tag complements are attached to solid phase supports. Such oligonucleotide tags when used with their corresponding tag complements provide a means of enhancing specificity of hybridization for sorting, tracking, or labeling molecules, especially polynucleotides.
Minimally cross-hybridizing sets of oligonucleotide tags and tag complements may be synthesized either combinatorially or individually depending on the size of the set desired and the degree to which cross-hybridization is sought to be minimized (or stated another way, the degree to which specificity is sought to be enhanced). For example, a minimally cross-hybridizing set may consist of a set of individually synthesized 10-mer sequences that differ from each other by at least 4 nucleotides, such set having a maximum size of 332 (when composed of 3 kinds of nucleotides and counted using a computer program such as disclosed in Appendix Ic). Alternatively, a minimally cross-hybridizing set of oligonucleotide tags may also be assembled combinatorially from subunits which themselves are selected from a minimally cross-hybridizing set. For example, a set of minimally cross-hybridizing 12-mers differing from one another by at least three nucleotides may be synthesized by assembling 3 subunits selected from a set of minimally cross-hybridizing 4-mers that each differ from one another by three nucleotides. Such an embodiment gives a maximally sized set of 93, or 729, 12-mers. The number 9 is number of oligonucleotides listed by the computer program of Appendix Ia, which assumes, as with the 10-mers, that only 3 of the 4 different types of nucleotides are used. The set is described as xe2x80x9cmaximalxe2x80x9d because the computer programs of Appendices Ia-c provide the largest set for a given input (e.g. length, composition, difference in number of nucleotides between members). Additional minimally cross-hybridizing sets may be formed from subsets of such calculated sets.
Oligonucleotide tags may be single stranded and be designed for specific hybridization to single stranded tag complements by duplex formation or for specific hybridization to double stranded tag complements by triplex formation. Oligonucleotide tags may also be double stranded and be designed for specific hybridization to single stranded tag complements by triplex formation.
When synthesized combinatorially, an oligonucleotide tag of the invention preferably consists of a plurality of subunits, each subunit consisting of an oligonucleotide of 3 to 9 nucleotides in length wherein each subunit is selected from the same minimally cross-hybridizing set. In such embodiments, the number of oligonucleotide tags available depends on the number of subunits per tag and on the length of the subunits. The number is generally much less than the number of all possible sequences the length of the tag, which for tag n nucleotides long would be 4n.
In one aspect of my invention, complements of oligonucleotide tags attached to a solid phase support are used to sort polynucleotides from a mixture of polynucleotides each containing a tag. In this embodiment, complements of the oligonucleotide tags are synthesized on the surface of a solid phase support, such as a microscopic bead or a specific location on an array of synthesis locations on a single support, such that populations of identical sequences are produced in specific regions. That is, the surface of each support, in the case of a bead, or of each region, in the case of an array, is derivatized by only one type of complement which has a particular sequence. The population of such beads or regions contains a repertoire of complements with distinct sequences. As used herein in reference to oligonucleotide tags and tag complements, the term xe2x80x9crepertoirexe2x80x9d means the set of minimally cross-hybridizing set of oligonucleotides that make up the tags in a particular embodiment or the corresponding set of tag complements.
The polynucleotides to be sorted each have an oligonucleotide tag attached, such that different polynucleotides have different tags. As explained more fully below, this condition is achieved by employing a repertoire of tags substantially greater than the population of polynucleotides and by taking a sufficiently small sample of tagged polynucleotides from the full ensemble of tagged polynucleotides. After such sampling, when the populations of supports and polynucleotides are mixed under conditions which permit specific hybridization of the oligonucleotide tags with their respective complements, identical polynucleotides sort onto particular beads or regions. The sorted populations of polynucleotides can then be manipulated on the solid phase support by micro-biochemical techniques.
Generally, the method of my invention comprises the following steps: (a) attaching an oligonucleotide tag from a repertoire of tags to each molecule in a population of molecules (i) such that substantially all different molecules or different subpopulations of molecules in the population have different oligonucleotide tags attached and (ii) such that each oligonucleotide tag from the repertoire is selected from the same minimally cross-hybridizing set; and (b) sorting the molecules of the population onto one or more solid phase supports by specifically hybridizing the oligonucleotide tags with their respective complements attached to such supports.
An important aspect of my invention is the use of the oligonucleotide tags to sort polynucleotides for parallel sequence determination. Preferably, such sequencing is carried out by the following steps: (a) generating from the target polynucleotide a plurality of fragments that cover the target polynucleotide; (b) attaching an oligonucleotide tag from a repertoire of tags to each fragment of the plurality (i) such substantially all different fragments have different oligonucleotide tags attached and (ii) such that each oligonucleotide tag from the repertoire is selected from the same minimally cross-hybridizing set; (c) sorting the fragments onto one or more solid phase supports by specifically hybridizing the oligonucleotide tags with their respective complements attached to the solid phase supports; (d) determining the nucleotide sequence of a portion of each of the fragments of the plurality, preferably by a single-base sequencing methodology as described below; and (e) determining the nucleotide sequence of the target polynucleotide by collating the sequences of the fragments.
Another important aspect of my invention is the determination of a profile, or a frequency distribution, of genes being expressed in a given tissue or cell type, wherein each such gene is identified by a portion of its sequence. Preferably, such frequency distribution is determined by the following steps: (a) forming a cDNA library from a population of mRNA molecules, each cDNA molecule in the cDNA library having an oligonucleotide tag attached, (i) such that substantially all different cDNA molecules have different oligonucleotide tags attached and (ii) such that each oligonucleotide tag from the repertoire is selected from the same minimally cross-hybridizing set; (b) sorting the cDNA molecules by specifically hybridizing the oligonucleotide tags with their respective complements attached to one or more solid phase supports; (c) determining the nucleotide sequence of a portion of each of the sorted cDNA molecules; and (d) forming a frequency distribution of mRNA molecules from the nucleotide sequences of the portions of sorted cDNA molecules.
My invention overcomes a key deficiency of current methods of tagging or labeling molecules with oligonucleotides: By coding the sequences of the tags in accordance with the invention, the stability of any mismatched duplex or triplex between a tag and a complement to another tag is far lower than that of any perfectly matched duplex between the tag and its own complement. Thus, the problem of incorrect sorting because of mismatch duplexes of GC-rich tags being more stable than perfectly matched AT-rich tags is eliminated.
When used in combination with solid phase supports, such as microscopic beads, my invention provides a readily automated system for manipulating and sorting polynucleotides, particularly useful in large-scale parallel operations, such as large-scale DNA sequencing, wherein many target polynucleotides or many segments of a single target polynucleotide are sequenced and/or analyzed simultaneously.