The relationship between structure and function of macromolecules is of fundamental importance in the understanding of biological systems. These relationships are important to understanding, for example, the functions of enzymes, structural proteins and signalling proteins, ways in which cells communicate with each other, as well as mechanisms of cellular control and metabolic feedback.
Genetic information is critical in continuation of life processes. Life is substantially informationally based and its genetic content controls the growth and reproduction of the organism and its complements. The amino acid sequences of polypeptides, which are critical features of all living systems, are encoded by the genetic material of the cell. Further, the properties of these polypeptides, e.g., as enzymes, functional proteins, and structural proteins, are determined by the sequence of amino acids which make them up. As structure and function are integrally related, many biological functions may be explained by elucidating the underlying structural features which provide those functions, and these structures are determined by the underlying genetic information in the form of polynucleotide sequences. Further, in addition to encoding polypeptides, polynucleotide sequences also can be involved in control and regulation of gene expression. It therefore follows that the determination of the make-up of this genetic information has achieved significant scientific importance.
Physical maps of genomic DNA assist in establishing the relationship between genetic loci and the DNA fragments which carry these loci in a clone library. Physical maps include “hard” maps which are overlapping cloned DNA fragments (“contigs”) ordered as they are found in the genome of origin, and “soft” maps which consist of long range restriction enzyme and cytogenetic maps (Stefton and Goodfellow, 1992). In the latter case, the combination of rare cutting restriction endonucleases (e.g., NotI) and pulse gel electrophoresis allows for the large scale mapping of genomic DNAs. These methods provide a low resolution or top down approach to genomic mapping.
A bottom up approach is exemplified by construction of contiguous or “contig” maps. Initial attempts to construct contig maps for the human genome have been based upon ordering inserts cloned into cosmids. More recent studies have utilized yeast artificial chromosomes (YACs) which allow for cloning larger inserts. The construction of contig maps require that many clones be examined (4-5 genome equivalents) in order to assure that sufficient overlap between clones is achieved. Currently, four approaches are used to identify overlapping sequences.
The first method is restriction enzyme fingerprinting. This method involves the electrophoretic sizing of restriction enzyme generated DNA fragments for each clone and establishing a criterion for clone overlap based on the similarity of fragment sets produced for each clone. The sensitivity and specificity of this approach has been improved by labelling of fragments using ligation, and end-filling techniques. The detection of repetitive sequence elements (e.g., [GT]n) has also been employed to provide characteristic markers.
The second method generally employed in mapping applications is the binary scoring method. This method involves the immobilization of members of a clone library to filters and hybridization with sets of oligonucleotide probes. Several mathematical models have been developed to avoid the need for large numbers of the probe sets which are designed to detect the overlap regions.
A third method is the Sequence Tagged Site (“STS”) method. This method employs PCR techniques and gel analysis to generate DNA products whose lengths characterize them as being related to common regions of sequence that are present in overlapping clones. The sequence of the primary pairs and the characteristic distance between them provides sufficient information to establish a single copy landmark (SCL) which is analogous to single copy probes that are unique in the entire genome.
A fourth method uses cross-hybridizing libraries. This method involves the immobilization of two or more pools of cosmid libraries followed by cross-hybridization experiments between pairs of the libraries. This cross-hybridization demonstrates shared cloned sequences between the library pairs. See, e.g., Kupfer, et al., (1995) Genomics 27:90-100.
Although each of these methods is capable of generating useful physical maps of genomic DNA, they each involve complex series of reaction steps including multiple independent synthesis, labelling and detection procedures.
Traditional restriction endonuclease mapping techniques, i.e., as described above, typically utilize restriction enzyme recognition/cleavage sites as genetic markers. These methods generally employ Type-II restriction endonucleases, e.g., EcoRI, HindIII and BamHI, which will typically recognize specific palindromic nucleotide sequences, or restriction sites, within the polynucleotide sequence to be mapped, and cleave the sequence at that site. The restriction fragments which result from the cleavage of separate fragments of the polynucleotide (i.e., from a prior digestion) are then separated by size. Overlap is shown where restriction fragments of the same size appear from Type-II endonuclease digestion of separate polynucleotide fragments.
Type-IIs endonucleases, on the other hand, generally recognize non-palindromic sequences. Further, these endonucleases generally cleave outside of their recognition site, thus producing overhangs of ambiguous base pairs. Szybalski, 1985, Gene 40:169-173. Additionally, as a result of their non-palindromic recognition sequences, the use of Type-IIs endonucleases will generate more markers per Kb than a similar Type-II endonuclease, e.g., approximately twice as often.
The use of Type-IIs endonucleases in mapping genomic markers has been described in, e.g., Brenner, et al., P.N.A.S. 86:8902-8906 (1989). The methods described involved cleavage of genomic DNA with a Type-IIs endonuclease, followed by polymerization with a mixture of the four deoxynucleotides as well as one of the four specific fluorescently labelled dideoxynucleotides (ddA, ddT, ddG or ddC). Each successive unpaired nucleotide within the overhang of the Type-IIs cleavage site would be filled by either a normal nucleotide or the labelled dideoxynucleotide. Where the latter occurred, polymerization stopped. Thus, the polymerization reaction yields an array of double stranded fluorescent DNA fragments of slightly different sizes. By reading the size from smallest size to largest, in each of the nucleotide groups, one can determine the specific sequence of the overhang. However, this method can be time consuming and yields only the sequence of the overhang region.
Oligonucleotide probes have long been used to detect complementary nucleic acid sequences in a nucleic acid of interest (the “target” nucleic acid). In some assay formats, the oligonucleotide probe is tethered, i.e., by covalent attachment, to a solid support, and arrays of oligonucleotide probes immobilized on solid supports have been used to detect specific nucleic acid sequences in a target nucleic acid. See, e.g., U.S. patent application Ser. No. 08/082,937 (currently abandoned), filed Jun. 25, 1993, which is incorporated herein by reference. Others have proposed the use of large numbers of oligonucleotide probes to provide the complete nucleic acid sequence of a target nucleic acid but failed to provide an enabling method for using arrays of immobilized probes for this purpose. See U.S. Pat. Nos. 5,202,231 and 5,002,867.
The development of VLSIPS.TM. (Very Large Substrate Immobilized Polymer Synthesis) technology has provided methods for making very large combinations of oligonucleotide probes in very small arrays. See U.S. Pat. No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092, each of which is incorporated herein by reference in its entirety for all purposes. U.S. patent application Ser. No. 08/082,937, incorporated above, also describes methods for making arrays of oligonucleotide probes that can be used to provide the complete sequence of a target nucleic add and to detect the presence of a nucleic acid containing a specific nucleotide sequence. Typically, the length of the nucleic acid probes on the substrate according to the present invention will be between about 5 and 100 bases, between about 5 and 50 bases, between about 8 and 30 bases, or between about 8 and 15 bases.
The construction of genetic linkage maps and the development of physical maps are essential steps on the pathway to determining the complete nucleotide sequence of the human or other genomes. Present methods used to construct these maps rely upon information obtained from a range of technologies including gel-based electrophoresis, hybridization, polymerase chain reaction (PCR) and chromosome banding. These methods, while providing useful mapping information, are very time consuming when applied to very large genome fragments or other nucleic acids. There is therefore a need to provide improved methods for the identification and correlation of genetic markers on a nucleic acid which can be used to rapidly generate genomic maps. The present invention meets these and other needs.