The relationship between structure and function of macromolecules is of fundamental importance in the understanding of biological systems. Such relationships are important to understanding, for example, the functions of enzymes, structural proteins, and signalling proteins, the ways in which cells communicate with one another, the mechanisms of cellular control and metabolic feedback, etc.
Genetic information is critical in the continuation of life processes. Life is substantially informationally based, and genetic content controls the growth and reproduction of the organism and its complements. Proteins, which are critical features of all living systems, are encoded by the genetic materials of the cell. More particularly, the properties of enzymes, functional proteins and structural proteins are determined by the sequence of amino acids from which they are made. As such, it has become very important to determine the genetic sequences of nucleotides which encode the enzymes, structural proteins and other effectors of biological functions. In addition to the segments of nucleotides which encode polypeptides, there are many nucleotide sequences which are involved in the control and regulation of gene expression.
The human genome project is an example of a project that is directed toward determining the complete sequence of the genome of the human organism. Although such a sequence would not necessarily correspond to the sequence of any specific individual, it will provide significant information as to the general organization and specific sequences contained within genomic segments from particular individuals. It will also provide mapping information useful for further detailed studies. The need for highly rapid, accurate, and inexpensive sequencing technology is nowhere more apparent than in a demanding sequencing project such as this. To complete the sequencing of a human genome will require the determination of approximately 3×109, or 3 billion, base pairs.
The procedures typically used today for sequencing include the methods described in Sanger, et al., Proc. Natl. Acad. Sci. USA 74:5463-5467 (1977), and Maxam, et al., Methods in Enzymology 65:499-559 (1980). The Sanger method utilizes enzymatic elongation with chain terminating dideoxy nucleotides. The Maxam and Gilbert method uses chemical reactions exhibiting specificity of reactants to generate nucleotide specific cleavages. Both methods, however, require a practitioner to perform a large number of complex, manual manipulations. For example, such methods usually require the isolation of homogeneous DNA fragments, elaborate and tedious preparation of samples, preparation of a separating gel, application of samples to the gel, electrophoresing the samples on the gel, working up the finished gel, and analysis of the results of the procedure.
Alternative techniques have been proposed for sequencing a nucleic acid. PCT patent Publication No. 92/10588, incorporated herein by reference for all purposes, describes one improved technique in which the sequence of a labeled, target nucleic acid is determined by hybridization to an array of nucleic acid probes on a substrate. Each probe is located at a positionally distinguishable location on the substrate. When the labeled target is exposed to the substrate, it binds at locations that contain complementary nucleotide sequences. Through knowledge of the sequence of the probes at the binding locations, one can determine the nucleotide sequence of the target nucleic acid. The technique is particularly efficient when very large arrays of nucleic acid probes are utilized. Such arrays can be formed according to the techniques described in U.S. Pat. No. 5,143,854 issued to Pirrung, et al. See also, U.S. application Ser. No. 07/805,727, both of which are incorporated herein by reference for all purposes.
When the nucleic acid probes are of a length shorter than the target, one can employ a reconstruction technique to determine the sequence of the larger target based on affinity data from the shorter probes. See, U.S. Pat. No. 5,202,231 issued to Drmanac, et al., and PCT patent Publication No. 89/10977 issued to Southern. One technique for overcoming this difficulty has been termed sequencing by hybridization or SBH. Assume, for example, that a 12-mer target DNA, i.e., 5′-AGCCTAGCTGAA (SEQ ID NO:1), is mixed with an array of all octanucleotide probes. If the target binds only to those probes having an exactly complementary nucleotide sequence, only five of the 65,536 octamer probes (i.e., 3′-TCGGATCG, CGGATCGA, GGATCGAC, GATCGACT, and ATCGACTT) will hybridize to the target. Alignment of the overlapping sequences from the hybridizing probes reconstructs the complement of the original 12-mer target:
TCGGATCG(SEQ ID NO: 2)  CGGATCGA   GGATCGAC    GATCGACT     ATCGACTT TCGGATCGAGTT
Although such techniques have been quite useful, it would be helpful to have additional methods which can effectively discriminate between fully complementary hybrids and those that differ by one or more base pairs.
In addition to knowing the genetic sequences of the nucleotides which encode the enzymes, structural proteins and other effectors of biological functions, it is important to known how such species interact A number of biochemical processes involve the interaction of some species, e.g., a drug, a peptide or protein, or RNA, with double-stranded DNA. For example, protein/DNA binding interactions are involved with a number of transcription factors as well as with tumor suppression associated with the p53 protein and the genes contributing to a number of cancer conditions. As such, it would be advantages to have methods for preparing libraries of diverse double-stranded nucleic acid sequences and probes which can be used, for example, in screening studies for the determination of binding affinity exhibited by binding proteins, drugs or RNA.
Methods of synthesizing desired single stranded DNA sequences are well known to those of skill in the art. In particular, methods of synthesizing oligonucleotides are found in, for example, Oligonucleotide Synthesis: A Practical Approach, Gait, ed., IRL Press, Oxford (1984), incorporated herein by reference in its entirety for all purposes. Synthesizing unimolecular double-stranded DNA in solution has also been described. See, Durand, et al., Nucleic Acids Res. 18:6353-6359 (1990) and Thomson, et al., Nucleic Acids Res. 21:5600-5603 (1993), the disclosures of both being incorporated herein by reference.
Solid phase synthesis of biological polymers has been evolving since the early “Merrifield” solid phase peptide synthesis, described in Merrifield, J. Am. Chem. Soc. 85:2149-2154 (1963), incorporated herein by reference for all purposes. Solid-phase synthesis techniques have been provided for the synthesis of several peptide sequences on, for example, a number of “pins.” See, e.g., Geysen, et al., J. Immun. Meth. 102:259-274 (1987), incorporated herein by reference for all purposes. Other solid-phase techniques involve, for example, synthesis of various peptide sequences on different cellulose disks supported in a column. See, Frank and Doling, Tetrahedron 44:6031-6040 (1988), incorporated herein by reference for all purposes. Still other solid-phase techniques are described in U.S. Pat. No. 4,728,502 issued to Hamill and WO 90/00626 (Beattie, inventor). Unfortunately, each of these techniques produces only a relatively low density array of polymers. For example, the technique described in Geysen, et al. is limited to producing 96 different polymers on pins spaced in the dimensions of a standard microtiter plate.
Improved methods of forming large arrays of oligonucleotides, peptides and other polymer sequences in a short period of time have been devised. Of particular note, Pirrung, et al., U.S. Pat. No. 5,143,854 (see also PCT Application No. WO 90/15070) and Fodor, et al., PCT Publication No. WO 92/10092, all incorporated herein by reference, disclose methods of forming vast arrays of peptides, oligonucleotides and other polymer sequences using, for example, light-directed synthesis techniques. See also, Fodor, et al., Science, 251:767-777 (1991), incorporated herein by reference for all purposes. These procedures are now referred to as VLSIPS™ procedures.
More particularly, in the Fodor, et al., PCT application, an elegant method is described for using a computer-controlled system to direct a VLSIPS™ procedure. Using this approach, one heterogenous array of polymers is converted, through simultaneous coupling at a number of reaction sites, into a different heterogenous array. See, U.S. application Ser. Nos. 07/796,243 and 07/980,523, the disclosures of which are incorporated herein for all purposes.
Although such techniques have been quite useful, it would be advantageous to have additional methods for preparing libraries of diverse double-stranded nucleic acid sequences and probes which can be used, for example, in screening studies for the determination of binding affinity exhibited by binding proteins, drugs or RNA.