This invention is directed to methods and kits for creating and analyzing molecules using uniquely identifiable tags. The invention is also directed to methods and kits that use uniquely identifiable tags for sequencing DNA, for determining mutations, including substitutions, deletions, and additions, in sample genes, and monitoring mRNA populations.
Biologists and chemists have long sought methods to identify a given molecule in a collection of thousands or millions or more of different molecular species. In large mixtures of many different molecules, it is challenging to identify any one molecule or molecular species rapidly. It is often even more difficult to identify several hundred or thousand non-identical or dissimilar species within a collection of many thousands or millions or more of different molecular species. It would be beneficial to functionally tag or xe2x80x9cbar codexe2x80x9d large numbers of molecular species for rapid, simultaneous identification.
To this end, the idea of using molecules to identify other molecules has emerged. As one example, it is now possible to use combinatorial synthesis techniques to develop large or extremely large collections of different but similar molecular species.
Combinatorial chemistry methods permit the synthesis of large numbers of different molecules in a mixture. In standard xe2x80x9cpool and splitxe2x80x9d combinatorial methods, each molecule in the mixture is associated with a tag or series of tags helpful in determining the identity of the molecule to which the tag is attached. See, for example, Ohlmeyer, M. H. J., et al., xe2x80x9cComplex Synthetic Chemical Libraries Indexed With Molecular Tagsxe2x80x9d Proc. Natl. Acad. Sci. 90:10922-10926, 1993; Pinilla, C., et al., xe2x80x9cVersatility of Positional Scanning Synthetic Combinational Libraries for the Identification of Individual Compoundsxe2x80x9d Drug Devel. Res. 33:133-145, 1994; Gallop, M. A., et al. xe2x80x9cApplications for Combinational Technologies to Drug Discovery. *1. Background and Peptide Combinational Libraries.xe2x80x9d J. Med. Chem. 37:1233-1251, 1994; Gordon, E. M., et al., xe2x80x9cApplications of Combinational Technologies to Drug Discovery. 2. Combinational Organic Synthesis, Library Screening Strategies, and Future Directions.xe2x80x9d J. Med. Chem. 37:1385-1401, 1994; Janda, K. D., xe2x80x9cTagged Versus Untagged Libraries: Methods for the Generation and Screening of Combinational Chemical Libraries.xe2x80x9d Proc. Natl. Acad. Sci. 91:10779-10785, 1994; Dower, W. J., et al., PCT/US92/07815, WO 93/06121 xe2x80x9cMethod of Synthesizing Diverse Collections of Oligomersxe2x80x9d; Matson, R. S. et al., U.S. Pat. No. 5,429,807, xe2x80x9cMethod and Apparatus for Creating Biopolymer Arrays on a Solid Support Surfacexe2x80x9d; Southern, E. M., et al., xe2x80x9cArrays of Complementary Oligonucleotides for Analyzing the Hybridization Behavior of Nucleic Acids.xe2x80x9d Nucl. Acids. Res. 22:1368-1373, 1994; Southern, E. M., xe2x80x9cDNA Fingerprinting by Hybridization to Oligonucleotide Arrays.xe2x80x9d Electrophoresis 16:1539-1542, 1995; Drmanac, R. T. and Crkvenjakov, R. B., xe2x80x9cMethod of Determining an Ordered Sequence of Subfragments of a Nucleic Acid Fragment by Hybridization of a Oligonucleotide Probesxe2x80x9d U.S. Pat. No. 5,492,806; Drmanac, R. T. and Crkvenjakov, R. B., xe2x80x9cMethod of Sequencing by Hybridization of Oligonucleotide Probesxe2x80x9d U.S. Pat. No. 5,525,464; McGall, G. H., et al., xe2x80x9cSpatially-Addressable Immobilization of Oligonucleotides and Other Biological Polymers on Surfacesxe2x80x9d U.S. Pat. No. 5,412,087; Dower, W. J. and Fodor, S. P. A., xe2x80x9cSequencing of Surface Immobilized Polymers Utilizing Microfluorescence Detectionxe2x80x9d U.S. Pat. No. 5,547,839; Fodor, S. P. A., et al., xe2x80x9cArray of Oligonucleotides on a Solid Substratexe2x80x9d U.S. Pat. No. 5,445,934; and Fodor, S. P. A., xe2x80x9cSynthesis and Screening of Immobilized Oligonucleotide Arraysxe2x80x9d U.S. Pat. No. 5,510,270. Typically, a combinatorial synthesis will proceed in xe2x80x9cstagesxe2x80x9d with two or more reaction vessels per stage. The purpose of each reaction vessel is to add a unique chemical moiety to a growing collection of chemical compounds.
Each moiety is also associated with a uniquely identifiable xe2x80x9ctag.xe2x80x9d The tag is typically attached to the same solid support to which the growing chemical compounds are attached. Thus, attachment of a tag to a solid support (typically a bead) conveys the information about the bead concerning the particular reaction vessel through which the bead has passed during the synthesis. In pool and split strategies, after the tags are attached in a particular stage, all of the reaction vessel contents are pooled, mixed, and divided and dispersed into new reaction vessels in the next stage. Each moiety added in each new reaction vessel will also be associated with a unique tag added to the beads. Thus, the collection of tag molecules on each bead conveys the xe2x80x9csynthetic pathwayxe2x80x9d though which the particular bead was placed.
In standard screening of combinatorial chemistry libraries, information regarding the order of addition of the tags and the linkage of tags to one another is not needed. Combinatorial chemical libraries are typically screened in the hopes of finding a few members giving the strongest positive signals in the screening assay. The screens are typically performed in separate reaction wells, where one or a few members of the combinatorial library (one or a few beads) is placed in each well. If a particular member scores positively, the composition of the compound can be determined by looking at the tags that are attached to the bead to which the compound is (or was) attached. If one is examining the tags attached to only a single bead, then the synthetic pathway can be identified.
For example, suppose that in the construction of a particular combinatorial chemical library that there are four parallel chemical steps in each synthetic stage, and that there are four synthetic stages each linked by a pool and split step. If there are 16 uniquely identifiable tag molecules available, then each bead will have four tag molecules associated with it (corresponding to the four stages of chemical synthesis). Each tag molecule becomes a marker for each of the 16 reaction vessels. Any particular bead will have traveled through four of the reaction vessels during the procedure, and the four tag molecules that become associated with the bead will reveal the xe2x80x9csynthetic pathwayxe2x80x9d of the bead provided that each bead is examined separately.
There are instances, however, in which it would be desirable to examine 100 positive beads together. If each bead contains four types of tag molecules and all of the tags are released from the beads and examined together, it will not be possible to determine the 100 different pathways that were used. Since there are only 16 different tag types, many pathways will use the same tag types in some but not all of their synthetic steps.
Thus, a primary difficulty in using such techniques lies in screening all of the species for those containing the desired activities or properties and then analyzing the molecular makeup of such species. To this end it has been proposed to use unique combinations of nucleotides to identify protein sequences that are constructed with combinatorial synthesis techniques. Brenner, S. and R. A. Lerner, xe2x80x9cEncoded Combinatorial Chemistry,xe2x80x9d Proc. Natl. Acad. Sci. USA 89:5381-83 (June 1992). The Brenner method decodes the unique combinations of nucleotides by actually sequencing the nucleotide tags. Although this method may permit one to determine the identity of a large number of molecules in a combinatorial library, the method still requires the physical separation of the linked tags (oligonucleotides) themselves for individual analysis (by PCR and cloning followed by DNA sequencing). Thus, the method fails to identify a large subset of molecules simultaneously. It merely shifts the need from physical separation and isolation of the beads to physical separation and isolation (cloning) of amplicons. In addition, the Brenner method would not permit the use of tags as a substitute for traditional DNA sequencing methods, since the analysis of the tags relies on traditional DNA sequencing methods.
It has also been proposed that microelectronic devices can be used to identify particular species being built through combinatorial synthesis techniques. Nicolaou, K. C. et al., xe2x80x9cRadiofrequency Encoded Combinatorial Chemistry,xe2x80x9d Angew. Chem. Int. Ed. Eng. 34:2289-91 (1995). These techniques, however, require the physical separation of the linked tags from one another prior to the decoding of the information the tags have encoded about the target molecules. Thus, these methods are not very useful to identify simultaneously a large subset of target molecules. A method that allows the simultaneous identification or analysis of large subsets of target molecules contained within a very large collection of similar or dissimilar molecules would greatly enhance the power, usefulness, speed, and/or ease of such identification or analysis.
Nucleic acids represent a particularly interesting collection of target molecules with which to apply the invention. Nucleic acids typically are found in nature as collections or sequences of nucleotides. DNA and RNA exist as linear sequences of nucleotides, and such sequences are typically found with other such sequences to make populations of nucleic acid sequences. For example, total cellular RNA comprises many types of RNA, including ribosomal, messenger, nuclear, and transfer RNA. Each such type comprises a collection of sequences. There are many different transfer RNA (tRNA) molecules corresponding to the various amino acids. There are many different messenger RNA (mRNA) molecules corresponding to the various genes of a species. DNA is also found as mixtures of nucleotide sequences. DNA from plants and animals is typically found as mixtures of chromosomes, which are linear sequences of nucleotides.
It is often difficult to study large collections of nucleic acid sequences because it is usually not easy to identify one nucleic acid molecule from another. It would be advantageous to be able to identify hundreds or more of non-homologous nucleic acid molecules simultaneously within collections of thousands or millions of nucleic acid molecules.
Different nucleic acid sequences can be different in molecular weight, and they sometimes can be resolved by electrophoresis, chromatography, or mass spectroscopy. However, different nucleic acid sequences are not always different in length or molecular weight. Different nucleic acid sequences are, by definition, different in the linear order of their nucleotides.
Probes can be created to distinguish one nucleotide sequence from many others. Such probes are known to be of protein, nucleic acid, or other synthetic chemical composition. For example, DNA and RNA binding proteins can recognize and bind to a specific sequence in a nucleic acid molecule. However, the number of such binding proteins is somewhat limited. Restriction enzymes can cleave nucleic acid molecules into fragments; yet this usually involves destruction of the molecules themselves, and nucleic acid molecules will not always have different xe2x80x9crestriction mapsxe2x80x9d for a given set of restriction enzymes. Moreover, restriction mapping the naturally occurring restriction sites in a large set of different nucleic acid molecules simultaneously can be very difficult, if not impossible, due to redundancies in the map patterns.
Nucleic acids can be tagged with hapten molecules that can be recognized by antibody molecules. However, the number of available hapten/antibody sets is limited. Nucleic acid molecules can be tagged with fluorescent dyes. The number of known fluorescent dyes with non-overlapping visible emission spectra, however, is fairly small. Nucleic acid molecules can be tagged with radioactive markers, but the number of known independently distinguishable radioisotopes that can be functionally incorporated into nucleic acids is also small. Nucleic acids can be tagged with enzymes, but the number of known independently distinguishable enzymes that can be functionally incorporated into nucleic acids is also small. Any one of these detection strategies, acting independently, can be limited. As discussed below, an aspect of the present invention is to combine strategies to encode more information about the target nucleic acid sequences.
Two different techniques have been developed to try to screen target DNA populations by using complementary nucleic acid probe hybridization to form a specific duplex under conditions where non-complementary sequences usually will not form a duplex. For any given target nucleic acid, a nucleic acid probe molecule complementary to all or some of the target DNA can usually be synthesized chemically. If the sequence of the target is unknown, a large number of different nucleic acid probes can be synthesized. However, one must have a method to identify the nucleic acid probes being used to identify the nucleic acid targets. One of the two approaches has been to xe2x80x9cbinxe2x80x9d the different probes into different wells (test tubes) and to determine if a particular member of the target population can bind specifically to the probe molecule. This tedious method requires dispensing thousands of different probes into thousands of different bins and then testing the target nucleic acid population in each of the thousands of bins.
The second approach is an extension of the bin method, and uses a two-dimensional grid in place of the bins. See, e.g., Southern, E. M., et al., xe2x80x9cArrays of Complementary Oligonucleotides for Analyzing the Hybridization Behavior of Nucleic Acids.xe2x80x9d Nucl. Acids. Res. 22:1368-1373, 1994; Southern, E. M., xe2x80x9cDNA Fingerprinting by Hybridization to Oligonucleotide Arrays.xe2x80x9d Electrophoresis 16:1539-1542, 1995; Drmanac, R. T. and Crkvenjakov, R. B., xe2x80x9cMethod of Determining an Ordered Sequence of Subfragments of a Nucleic Acid Fragment by Hybridization of a Oligonucleotide Probesxe2x80x9d U.S. Pat. No. 5,492,806; Drmanac, R. T. and Crkvenjakov, R. B., xe2x80x9cMethod of Sequencing by Hybridization of Oligonucleotide Probesxe2x80x9d U.S. Pat. No. 5,525,464; U.S. Pat. No. 5,412,087; and U.S. Pat. No. 5,445,934. In the gridding method, a relatively large number of nucleic acid probe molecules are synthesized on a two-dimensional solid support such that the coordinates or physical location (address) of the sample conveys its sequence identity. Since the probes are permanently attached to the solid support they can be exposed to the target nucleic acids simultaneously without the need for physical separation. Such gridding methods make it possible to display hundreds of thousands of probes to a target sample simultaneously.
The gridding method suffers from several limitations, however. If the probes are chemically synthesized, they are typically 20 nucleotides or shorter in length. It is not always trivial, however, to find conditions where only the desired short probe duplex will form without undesired duplexes forming. For example, nucleic acids that are rich in adenines and thymidines (A:T rich) do not form duplexes that are as stable as nucleic acids that are rich in guanines and cytosines (G:C rich) under the same reaction conditions. If the hybridization temperature is too high, certain A:T rich sequences will melt whereas G:C rich sequences will remain hybridized. However, if the temperature is lowered for A:T rich binding, certain G:C rich duplexes having some mismatched base pairs can form. Therefore, it is sometimes difficult to create a large collection of short, sequence-specific probes that will operate well together under a single set of conditions.
Longer probes can be created from biological sources or in vitro amplification strategies. These probes often do not suffer from the A:T/G:C content problem of some shorter probes, since the base content of sequences tends to average out over longer stretches. However, long probe grids are more expensive to make and, under their current configurations, often are not able to detect small changes (such as mutations) in the target nucleic acid sample. While short probes may detect such mutations by hybridization, they can only do so well if the particular mutations were anticipated, and the matrix was designed to detect them.
There are other limitations to two-dimensional grid analysis. The concentration of a probe available for interaction is limited by the amount of the probe that can be attached to the solid support. In addition, the target nucleic acids must diffuse to the probe since the bound probe cannot diffuse to the target nucleic acids. These factors diminish reaction rates and signal strength for such two-dimensional formats. These limitations may be obviated in a liquid phase hybridization system. In a liquid phase hybridization system, the concentration of the probe would not be limited by the solid support, both the target nucleic acids and the probes can diffuse toward each other, and signal amplification through cycling reactions could occur.
The present inventor is not aware of any current practical method to carry out and identify such multiple simultaneous hybridization reactions in liquid phase using a large collection of probes and targets. The lack of a rapid and effective way to specifically tag a large number of probes for subsequent identification hampers one from determining which probes successfully hybridize to target nucleic acid. The problems are compounded if a large collection of long probes is desired.
The present invention overcomes many of the limitations discussed above. Specifically, this invention permits the simultaneous identification of a large subset of target molecules out of a very large collection of similar or dissimilar molecular species. The present invention can be used to create tagged molecules that identify any collection of molecular species. For example, collections of peptides, antibodies, nucleic acids, or other chemical structures could be identified by tagged molecules using the methods described herein.
According to certain embodiments, the present invention provides an advantageous method to xe2x80x9cbar codexe2x80x9d collections of probes or analytes for use in a liquid phase hybridization reaction. In addition, certain embodiments of this invention provide tagged probes that are able to detect small changes or mutations in the target specimen. Certain embodiments of the present invention also permit such probes or analytes to detect the levels of a large number of different target species within a population of target species.
Specifically, in particular embodiments, this invention permits more rapid sequencing of large amounts of DNA than traditional DNA sequencing techniques. In other embodiments, this invention provides rapid identification of mutations, including substitutions, insertions, and/or deletions in target nucleic acid populations. The use of these embodiments to target genes, such as cancer or cystic fibrosis genes, would be useful in permitting a greater understanding of these disease states as well as identifying specific mutations present in any given individual. In other embodiments, this invention allows rapid monitoring of relative expression levels of a large population of mRNA molecules. This information would be valuable for assessing physiologic or disease states. For example, one can assess the dynamics of different cell types or cell states by analyzing relative mRNA concentrations. In yet other embodiments, the invention permits simultaneous and quick identification of many molecules produced in a combinatorial synthesis library without prior separation of the molecules or their tags.
In carrying out embodiments of the present invention, liquid phase detection can involve either short or long tagged nucleic acid probes or tags. According to embodiments that use the tags for identifying combinatorial synthesis molecules, the present invention employs unique molecular weights or unique lengths of the nucleic acid tags such that any number of molecules can be identified simultaneously and accurately without prior separation of each of the tags and/or molecules. Each weight or length will encode not only the identity of the building blocks used to make each molecule in the library, but also the order of synthesis used to make the molecule.
According to other embodiments, the invention provides methods of changing the genetic code of different nucleic acid sequences to another unique code for each unique sequence. The other unique code is designed such that it allows one to simultaneously and accurately determine the nucleic acid sequence without prior separation of the different nucleic acid sequences. The unique code is also called a tag.
In certain preferred embodiments, the unique code or tag can encode anywhere up to 420 different sequences, which allows one to determine simultaneously any possible combination of sequences for up to 20 nucleotide stretches. Certain embodiments may also include encoding longer sequences.
According to certain embodiments, the unique tags are created using pool and split combinatorial synthesis methods. In contrast to traditional combinatorial synthesis, which creates random libraries of molecules, however, these embodiments use combinatorial synthesis to create specific tags. In other words, the combinatorial synthesis is used to translate the genetic code of different sequences in a sample into a different unique code that facilitates rapid identification of the nucleic acid sequences in the sample in a subsequent decoding step. That subsequent decoding step does not require separation of the different sequences before performing the decoding step, nor does it require one to separately determine each nucleotide of a sequence a single base at a time.
According to certain embodiments, the combinatorial split and pool tag synthesis employs nucleic acid amplification techniques, such as PCR. These techniques are used to selectively amplify particular tags being created based on a particular nucleic acid sequence in the sample. In other words, the amplification procedure allows one to create the new code on the tags in view of specific sequences being amplified in the sample.
In certain embodiments, the present invention employs a variety of different types of tags associated with a single probe or tagged nucleic acid. Thus, for example, a DNA probe can be used to encode the sequence of a target DNA fragment by a combination of tags including (but not limited to): differing base lengths of all or a portion of the probe; fluorescent dyes of different emission wavelengths; biotinylated (or other affinity molecules) attached to dideoxy nucleotides added to the probe by conventional primer extension reactions; and the pooling of probes with identical nucleotides at identical positions. Other different tags that can be used in combination with the tags above and/or each other include (but are not limited to): molecular weight of all or a portion of a nucleic acid tag or probe; specific order of bases of all or a portion of a tag or probe in general; specific sequences within a tag or probe recognized by binding proteins, restriction enzymes, or other proteins or chemical species; and specific sequences within a tag or probe that can be detected by mass spectroscopy or NMR. Other tagging molecules include (but are not limited to): hapten molecules; molecules identified by their size; fluorescent dyes; radioactive markers; enzymes; affinity reagents; radiofrequency microelectronic devices; atoms that create identifiable NMR spectra; binding energy or xe2x80x9cmelting temperaturexe2x80x9d when hybridized with other molecules; dissociation of duplexes formed with other molecules in response to an electric or magnetic field; and ionic residues that are charged or uncharged at various pH""s. Another possible tag is segregation into discrete pools. In general, any property or item that is capable of being differentially detected can be used as a tag. In this manner, a variety of tags used in combination with a combinatorial labeling system can be used to exponentially expand the amount of information that can be encoded on a nucleic acid probe.
According to certain other embodiments, tags or probes that have already been prepared are provided in a kit that allows one to determine mutations, including substitutions, additions, deletions, or other changes in a known wild type nucleic acid sequence. These tags also employ an encoding scheme that changes the genetic code into another code that permits one to analyze short fragments of nucleic acid sequences without requiring one to sequence each nucleotide a single base at a time. The kits will permit the end-user to run one stage of primer extension in parallel with wild type test nucleic acid and with test nucleic acid and, then, to compare the products from those reactions on gels. That comparison will show specific differences between the wild type nucleic acid sequence and the test nucleic acid sequence. Those specific differences will allow the end-user to identify not only the identity of the specific changes (the identity of the changed nucleotide if it is a substitution or the identity of an added or deleted nucleotide), but also the location of the base changes in the nucleic acid sequence. The tags or probes can be prepared using the techniques discussed above for the DNA sequencing procedures.
According to certain other embodiments, methods and kits are provided that allow the rapid analysis of mRNA or cDNA populations, which can reveal the relative concentrations of members of the populations. Again, the methods and kits utilize an encoding method that translates the genetic code into another unique code that permits simultaneous analysis of specific nucleic acid sequence fragments within a population of many different nucleic acid fragments. The tags or probes used in these embodiments can be prepared using the techniques discussed above for the DNA sequencing procedures.