A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present specification includes a compact disc labeled xe2x80x9cCopy 1xe2x80x9d comprising a Computer Program Listing Appendix and an exact duplicate compact disc labeled xe2x80x9cCopy 2.xe2x80x9d Said Computer Program Listing Appendix was created on May 23, 2002 and comprises text listings of the computer programs written in the xe2x80x9cCxe2x80x9d language xe2x80x9ctags.ccp.txtxe2x80x9d (11,505 bytes) and xe2x80x9ctags895.ccp.txtxe2x80x9d (10,895 bytes). The content of the Computer Program Listing Appendix is hereby incorporated by reference in its entirety.
This invention provides sets of nucleic acid tags, arrays of oligonucleotide probes, nucleic acid-tagged sets of recombinant cells and other compositions, and methods of selecting oligonucleotide probe arrays. The invention relates to the selection and interaction of nucleic acids, and nucleic acids immobilized on solid substrates, including related chemistry, biology, and medical diagnostic uses.
Methods of forming large arrays of oligonucleotides and other polymers on a solid substrate are known. Pirrung et al., U.S. Pat. No. 5,143,854 (see also PCT Application No. WO 90/15070), McGall et al., U.S. Pat. No. 5,412,087, Chee et al. SN PCT/US94/12305, and Fodor et al., PCT Publication No. WO 92/10092 describe methods of forming arrays of oligonucleotides and other polymers using, for example, light-directed synthesis techniques.
In the Fodor et al. publication, methods are described for using computer-controlled systems to direct polymer array synthesis. Using the Fodor approach, one heterogenous array of polymers is converted, through simultaneous coupling at multiple reaction sites, into a different heterogenous array. See also, Fodor et al. (1991) Science, 251: 767-777; Lipshutz et al. (1995) BioTechniques 19(3): 442-447; Fodor et al. (1993) Nature 364: 555-556; and Medlin (1995) Environmental Health Perspectives 244-246. The arrays are typically placed on a solid surface with an area less than 1 inch2, although much larger surfaces are optionally used.
Additional methods applicable to polymer synthesis on a substrate are described, e.g., in U.S. Pat. No. 5,384,261, incorporated herein by reference for all purposes. In the methods disclosed in these applications, reagents are delivered to the substrate by flowing or spotting polymer synthesis reagents on predefined regions of the solid substrate. In each instance, certain activated regions of the substrate are physically separated from other regions when the monomer solutions are delivered to the various reaction sites, e.g., by means of groves, wells and the like.
Procedures for synthesizing polymer arrays are referred to herein as very large scale immobilized polymer synthesis (VLSIPS(trademark)) procedures. Oligonucleotide VLSIPS(trademark) arrays are useful, for instance, in a variety of procedures for monitoring test nucleic acids in a sample. In probe arrays with multiple probe sets, many distinct hybridization interactions can be monitored simultaneously. However, unwanted hybridization between probes, or between probes and other nucleic acids, can make analysis of multiple hybridizations problematic. This invention solves these and other problems.
With this invention it is now possible to label and detect many individual components present, inter alia, in molecular, cellular and viral libraries using a limited number of hybridization conditions. Components are labeled with specially selected nucleic acid tags, and the presence of individual tags is monitored by hybridization to a probe array (typically a VLSIPS(trademark) array of oligonucleotide probes). Thus, the tag nucleic acids are labels for the individual components, and the probe array provides a label reader which permits simultaneous detection of a very large number of tag nucleic acids. This facilitates massive parallel analysis of all of the components in a mixture in a single assay.
For instance, as explained herein, all of the members of a cellular library can be tested for response to an environmental stimulus using a mixture of all of the members of the cellular library in a single assay. This is accomplished, e.g., by labeling each member of the cellular library, e.g., by cloning a nucleic acid tag into each cell type in the library, mixing each cell type in the library in an appropriate solution, and exposing part of the solution to the selected environmental stimulus. The distribution of nucleic acids in the library before and after the environmental stimulus is compared by hybridization of the nucleic acids to a VLSIPS(trademark) array, allowing for detection of cells which are specifically affected by the environmental stimulus.
Accordingly, the present invention provides, inter alia, tag nucleic acids, sets of tag nucleic acids, methods of selecting tag nucleic acids, libraries of cells, viruses or the like containing tag nucleic acids, arrays of oligonucleotide probes, arrays of VLSIPS(trademark) probes, methods of selecting arrays of oligonucleotide probes, methods of detecting tag nucleic acids with VLSIPS(trademark) arrays and other features which will become clear upon further reading.
In one class of embodiments, the invention provides a method of selecting a set of tag nucleic acids designed for minimal cross hybridization to a VLSIPS(trademark) array. The absence of cross hybridization facilitates analysis of hybridization patterns to VLSIPS(trademark) arrays, because it reduces ambiguities in the interpretation of hybridization results which arise due to multiple nucleic acid species binding to a single species of probe on the VLSIPS(trademark) array. Thus, in the selection methods of the invention, potential tags are excluded from set of tags where they bind to the same nucleic acid as selected tags under stringent conditions. The selection methods typically include the steps of selecting a specific thermal binding stability for the tag acids against complementary probes, and excluding tags which contain self-complementary regions. Often, the thermal binding stability of the tags is selected by specifying parameters which influence binding stability, such as the length and base composition (e.g., by selecting tags with the same AT to GC ratio of nucleotides) for the tag nucleic acids. In this regard, tags which form more GC bonds upon binding a complementary probe require fewer overall bases to have the same binding stability with a complementary probe as tags which have fewer GC residues. Binding stability is also affected by base stacking interactions, the formation of secondary structures and the choice of solvent in which a tag is bound to a probe.
The size of the tags can vary substantially, but is typically from about 8-150 nucleotides, more typically between 10 and 100 nucleotides, often between about 15 and 30 nucleotides, generally between about 15 and 25 nucleotides and, in one preferred embodiment, about 20 nucleotides in length. In a few applications, the tags are substantially longer than the probes to which they hybridize. The use of longer tags increases the number of tags from which non-cross hybridizing probes can be selected.
The tag nucleic acids are optionally selected to have constant and variable regions, which facilitates elimination of secondary structure arising from self-complementarity, and provides structural features for cloning and amplifying the tags. For instance, PCR binding sites or restriction enzyme sites are optionally incorporated into constant regions in the tags. In other embodiments, short constant regions are added in coding theory methods to prevent misalignment of the tags. Constant regions are optionally cleaved from the tag during processing steps, for instance by cleaving the tag nucleic acids with class II restriction enzymes.
Often it is desirable to eliminate tags which contain runs of 4 nucleotides selected from the group consisting of 4 X residues 4 Y residues and 4 Z residues, where X is selected from the group consisting of G and C, Y is selected from the group consisting of G and A, and Z is selected from the group consisting of A and T. The elimination of tags from a tag set which contain such runs of nucleotides reduces the formation of secondary structure in the selected tags in the tag set. In some embodiments, certain runs are permitted, while others are excluded. For instance, in one embodiment, runs of 4 A/T or G/C nucleotides are prohibited.
In many embodiments, tags which differ by fewer than about 80% of the total number of nucleotides which comprise the tags are excluded. For instance, all selected tags in a selected tag set preferably differ by at least about 4-5 nucleotides. It is also desirable to exclude tags which share substantial regions of sequence identity, because the regions of identity can cross-hybridize to nucleic acids which have subsequence complementary to the region of identity. For instance, where 20-mer tags are identical over regions of 9 or more nucleotides, they are typically excluded.
The tags in the tag sets of the invention typically differ by at least two nucleotides, and preferably by 3-5 nucleotides for a typical 20-mer. A list of tags which differ by at least two nucleotides can be generated by pairwise comparison of each tag, or by other methods. For instance, the tag sequences can be aligned for maximal correspondence and tags with a single-mismatch discarded. In one class of embodiments, the number of A+G nucleotides in each of the variable regions of each of the tags is selected to be even (or, alternatively, odd), providing a xe2x80x9cparity basexe2x80x9d or xe2x80x9cerror correcting basexe2x80x9d which provides that each tag have at least two hybridization mismatches between every tag in the tag set, and any individual complementary nucleic acid probe (other than the probe which is a perfect complement to the tag). Other methods of ensuring that at least two mismatches exist between every tag in a tag set and any individual hybridization probe are also appropriate.
In general, the selection of the tag nucleic acids facilitates selection of the probe nucleic acids, e.g., on VLSIPS(trademark) arrays used to monitor the tag nucleic acids by hybridization. Specifically, the probes on the array are selected for their ability to hybridize to variable sequences in the set of tag nucleic acids (the xe2x80x9cvariablexe2x80x9d region of a tag which does not include a constant region is the entire tag). Thus, all of the rules for selection of tag nucleic acids can be applied to the selection of probe nucleic acids, for example by performing the tag selection steps and then determining the complementary set of probe nucleic acids.
In another class of embodiments, the invention provides compositions comprising sets of tag nucleic acids, which include a plurality of tag nucleic acids. In preferred embodiments, the set of tag nucleic acids comprises from 100-100,000 tags. Typically, a tag set will include between about 500 and 15,000 tags. Usually, the number of tags in a tag set is between about 5,000 and about 14,000 tags. In one preferred embodiment, a set of tags of the invention comprises about 8,000-9,000 tags. The tag sequences typically comprise a variable region, where the variable region for each tag nucleic acid in the set of tag nucleic acids has the same the same G+C to A+T ratio, approximately the same Tm, the same length and do not cross-hybridize to a single complementary probe nucleic acid. Most typically, the tag nucleic acids in the set of tag nucleic acids cannot be aligned with less than two differences between any two of the tag nucleic acids in the set of tag nucleic acids, and often at least 5 differences exist between any pair of tags in a tag set. In one embodiment, the tags also comprise a constant region such as a PCR primer binding site for amplification of the tag.
In one class of embodiments, the invention provides a method of labeling a composition, comprising associating a tag nucleic acid with the composition, wherein the tag nucleic acid is selected from a group of tag nucleic acids which do not cross-hybridize and which have a substantially similar Tm. Typically, the tag labels are detected with a VLSIPS(trademark) array which comprises probes complementary to the tags used to label the composition.
As described herein, preferred compositions include constituents of cellular, viral or molecular libraries such as recombinant cells, recombinant viruses or polymers. However, one of skill will readily appreciate that other compositions can also be labeled using the nucleic acid tags of the invention, and the tags detected using VLSIPS(trademark) arrays. For instance, high denomination currency can be labeled with a set of nucleic acid tags, and counterfeits detected by monitoring hybridization of a wash of the currency (or, e.g., a PCR amplification of attached nucleic acids which encode tag sequences) with an appropriate VLSIPS(trademark) array.
In another class of embodiments, the invention provides methods of pre-selecting experimental probes in an oligonucleotide probe array, wherein the probes have substantially uniform hybridization properties and do not cross hybridize to a target tag nucleic acid. In the methods, a ratio of G+C to A+T nucleotides shared by the experimental probes in the array is selected and all possible 4 nucleotide subsequences for the probes of the array are determined. All potential probes from the array which contain prohibited 4 nucleotide sub-sequences are excluded from the experimental probes of the array. 4 nucleotide subsequences are prohibited when the nucleotide subsequences are selected from the group consisting of self-complementary probes, A4 probes, T4 probes, and [G,C]4 probes. Also, where the target tag nucleic acid comprises a constant region, all probes complementary to the constant region sub-sequence of the target tag nucleic acid are prohibited, and not present in the tag set. Typically, a length for the probes in the array is selected, although non-hybridizing portions of the probe (i.e., nucleotides which do not hybridize to a target nucleic acid) optionally vary between different classes of probes. xe2x80x9cExperimental probesxe2x80x9d hybridize to a target tag nucleic acid, while xe2x80x9ccontrolxe2x80x9d probes either do not hybridize to a target tag nucleic acid, or bind to a nucleic acid which has hybridization properties which differ from those of the target tag nucleic acids in a tag nucleic acid set. For instance, control probes are optionally used in VLSIPS(trademark) arrays to check hybridization stringency against a known nucleic acid.
In one class of methods of the invention, a plurality of test nucleic acids are simultaneously detected in a sample. In the methods, an array of experimental probes which do not cross hybridize to a target under stringent conditions is used to detect the target nucleic acids. Typically, the ratio of G+C bases in each experimental probe is substantially identical. The probes of the array are arranged into probe sets in which each probe set comprises a homogeneous population of oligonucleotide probes. For example, many individual probes with the same nucleotide sequence are arranged in proximity to one another on the surface of an array to form a particular geometric shape. Probe sets are arranged in proximity to each other to form an array of probes. For instance, where the probe array is a VLSIPS(trademark) array, the probe sets are optionally arranged into squares on the surface of a substrate, forming a checkerboard pattern of probe sets on the substrate.
The probes of the array specifically hybridize to at least one test nucleic acid in the sample under stringent hybridization conditions. The method further comprises detecting hybridization of the test nucleic acids to the array of oligonucleotide probes. Typically, the test nucleic acids comprise tag sequences, which bind to the experimental probes of the array.
In one class of embodiments, the invention provides an array of oligonucleotide probes comprising a plurality of experimental oligonucleotide probe sets attached to a solid substrate, wherein each experimental oligonucleotide probe set in the array hybridizes to a different target nucleic acid under stringent hybridization conditions. Each experimental oligonucleotide probe in the probe sets of the array comprises a constant region and a variable region. The variable region does not cross hybridize with the constant region under stringent hybridization conditions, and the nucleic acid probes do not cross-hybridize to target nucleic acids. Typically, the probes from each probe set differ from the probes of every other probe set in the array by the arrangement of at least two nucleotides in the probes of the probe set. Generally, the ratio of G+C bases in each probe for each experimental probe set is substantially identical (meaning that the G+C ratio does not vary by more than 5%), thereby assuring that they hybridize to a target with similar avidity under similar hybridization conditions. The arrays optionally comprise control probes, e.g., to assess hybridization conditions by monitoring binding of a known quantitated nucleic acid to the control probe.
In another class of embodiments, the invention provides a plurality of recombinant cells or recombinant viruses comprising tag nucleic acids, which tag nucleic acids comprise a constant region and a variable region. Typically, the variable region for each tag nucleic acid in the set of tag nucleic acids has approximately the same Tm, (e.g., the same G+C to A+T ratio and the same length) and does not cross-hybridize to a probe nucleic acid. Different tag nucleic acids in the set of tag nucleic acids found in the different recombinant cells cannot be aligned with less than two differences between the tag nucleic acids. Generally, the recombinant cells are selected from a library of genetically distinct recombinant cells (eukaryotic, prokaryotic or archaebacterial) or viruses. For example, in one class of preferred embodiments, the cells are yeast cells. In another class of preferred embodiments, the cells are of mammalian origin.
The present invention provides arrays of oligonucleotides attached to solid substrates. Typically, the oligonucleotide probes in the array are arranged into probe sets at defined locations in the array to enhance signal processing of hybridization reactions between the oligonucleotide probes and test nucleic acids in a sample. The oligonucleotide arrays can have virtually any number of different oligonucleotide sets, determined largely by the number or variety of test nucleic acids or nucleic acid tags to be screened against the array in a given application. In one group of embodiments, the array has from 10 up to 100 oligonucleotide sets. In other groups of embodiments, the arrays have between 100 and 10,000 sets. In certain embodiments, the arrays have between 10,000 and 100,000 sets, and in yet other embodiments the arrays have between 100,000 and 1,000,000 sets. Most preferred embodiments will have between 7,500 and 12,500 sets. For example in one preferred embodiment, the arrays will comprise about 8,000 sets of oligonucleotide probes. In preferred embodiments, the array will have a density of more than 100 sets of oligonucleotides at known locations per cm2, or more preferably, more than 1000 sets per cm2. In some embodiments, the arrays have a density of more than 10,000 sets per cm2.
The present invention also provides kits embodying the inventive concepts outlined above. For example, kits of the invention comprise any of the arrays, cells, libraries or tag sets described herein. Also, because the methods of using the arrays and tags optionally include PCR, LCR and other in vitro amplification techniques for amplifying tag nucleic acids, the kits of the invention optionally include reagents for practicing in vitro amplification methods such as taq polymerase, nucleotides, computer software with tag selection programs and the like. The kits also optionally comprise nucleic acid labeling reagents, instructions, containers and other items that will be apparent to one of skill upon further reading.