While the most abundant type of variant in the human genome and the best-studied is the single-nucleotide polymorphism (SNP), it is increasingly clear that the so termed “fine-structural-variations” comprising alterations of copy number (insertions, deletions and duplications), inversions, translocations and other sequence rearrangements are integral features of the human and other genomes. These types of variations appear to be present in much greater frequency in the general population than originally thought. Evidence is mounting to indicate that structural variants can comprise millions of nucleotides of heterogeneity in each individual. Understanding the role of fine-structural-variations in genome evolution, interaction with the environment, phenotypic diversity and in disease are among the most actively investigated areas of current genomic research. For review, refer to Feuk et al (2006), Redon et al (2006), Check (2005), Cheng et al (2005), and Bailey et al (2002).
In comparison to analysis of SNPs, efficient high throughput methods for analysis of fine-structural-variations are not well developed. An important first step is the technique of array comparative genomic hybridization (array CGH) (Pinkel et al, 1998; Pinkel et al, U.S. Pat. Nos. 5,830,645 and 6,159,685), which enables the qualification of relative copy numbers between a target DNA and a reference DNA. Array CGH allows reliable detection of deoxyribonucleic acid (DNA) copy-number differences between DNA samples with a resolution at the level of a single arrayed bacterial artificial chromosome (BAC) clone (Snijders et al, 2001; Albertson et al, 2000; Pinkel et al, 1998). The adaptation of array CGH to cDNA (Heiskanen et al, 2000; Pollack et al, 1999) and to high-density oligo-nucleotide array platforms (Bignell et al, 2004; Brennan et al, 2004; Huang et al, 2004; Lucito et al, 2003) further extends the resolution and utility of this approach. Through its use, array CGH has led to the identification of gene copy number alterations that are associated with tumor (Pinkel and Albertson, 2005; Inazawa et al, 2004; Albertson and Pinkel, 2003; Pollack et al, 2002) and disease progression (Gonzalez et al, 2005).
1. Fosmid Pair-End Mapping
Despite the usefulness for copy number determination, array CGH is not suited to address other types of genomic structural variations, most notably, inversions, translocations and other types of nucleic acid rearrangements. Tuzun et al (2005) attempt to address these limitations with an approach termed “fosmid paired-end mapping.” This approach relies on the head-full mechanism of fosmid packaging to produce genomic DNA libraries with reasonably uniform ˜40-kilobase pairs (kb) size genomic inserts from test subjects. Experimentally, the actual fragments range from 32- to 48-kb, <3 standard deviations from the mean, 39.9+/−2.76-kb. End-terminal sequencing of the randomly selected ˜40-kb library inserts produces pairs of short sequence tags in which each tag-pair marks two genomic positions with separation of approximately 40-kb along the lengths of the target DNA. The tag-pairs are then computationally aligned to a reference genomic assembly and any discordance with either their expected orientation or with their ˜40-kb separation distance, would denote the presence of at least one structural difference between the target and the reference nucleic acid spanning that region. Tag-pairs having map positions that are separated by more than 40-kb signify the presence of a deletion on the target DNA in respect to the reference; map positions with separation of less than 40-kb signify an insertion of DNA in the target. Inconsistencies in the orientation of the pair of mapped tags denote potential DNA inversions or other complex chromosomal rearrangements. Chromosomal translocations are signified by assignment of the tag-pair to two different chromosomes on the reference sequence. Analysis of over a million individually purified fosmid clone inserts by conventional DNA sequencing enabled Tuzun et al (2005) to identify nearly 300 sites of structural variations between test subject and the reference genomic assembly.
The authors did not teach or disclose other methods to create tag-pairs, to create tag-pairs of different spacing to change the spatial resolution of analysis, to improve the homogeneity of the insert lengths in their library, to improve economy by use of the generation DNA sequencers, nor disclose methods to produce other types of sequence tag-pairs such as those of the present invention that can demarcate genomic positions based on the location and/or separation distance between pairs of adjacent endonuclease cleavage sites.
Many types of fine-structural-variations are not resolved by the ˜40-kb resolution window fixed by the fosmid-paired end mapping approach. Fosmid paired-end mapping has further limitations. Fosmid vectors propagate in host cells at a very low copy number, a property used to minimize potential recombination, rearrangement and other artifacts encountered during the propagation of certain genomic sequences in a microbial host. Despite the current use of amplifiable versions of fosmid vectors (Szybalski, U.S. Pat. No. 5,874,259) terminal sequencing of fosmid clones to generate sequence tags still has very poor economy due to low DNA yield when compared to conventional plasmids, making high-throughput automated template production and sequencing difficult to maintain. Furthermore, two separate sequence reactions are required to generate a tag-pair sequence from a single fosmid DNA template, thereby reducing the economy further.
While fosmid paired-end mapping is a useful start to identify fine-structural-variations in the human genome, immense cost and logistical efforts required to purify and sequence more than a million fosmid insert ends for each test subject preclude its use in broad population and cohort surveys to identify genomic variations that could be associated with complex diseases or in response to environmental factors and the like. Moreover, fosmid vectors and their variants generally propagate in very low copy-numbers in host cells making reliable automated DNA production and sequencing difficult to maintain. Hence, there is a need for an efficient, robust high throughput and low cost method for the identification of fine-structural-variations for use in genomic and association studies to link these genetic elements to disease, disease progression and disease susceptibility.
2. Existing Methods for the Generation of Genomic Tags
A variety of DNA-based fingerprinting approaches have been described in the art to characterize and to compare genomes (Wimmer et al, 2002; Kozdroj and van Elsas, 2001; Rouillard et al, 2001; Schloter et al, 2000). All these approaches employed some combinations of restriction endonuclease digestion of target DNA, PCR amplification, or gel electrophoretic separation. In common, these approaches are laboriously encumbered by the need to extract candidate DNA fragments from gels for DNA sequencing. A step forward was the work of Dunn et al (2002) where they described a method using the type IIS/type IIG restriction endonuclease, Mme I, to generate “Genomic Signature Tags” (GSTs) for analyzing genomic DNA. GSTs were generated by ligation of adaptors bearing a Mme I recognition site to genomic DNA fragments that were initially created by an initial digestion of the target genomic DNA with a type II restriction endonuclease followed by a second digestion with a frequent cutting tagging enzyme. Digestion of the adaptor ligated DNA with Mme I created a 21-bp tag (GST) with a fixed position in the DNA relative to the sites recognized by the initial restriction enzyme digestions. Following amplification by PCR, purified GSTs were oligomerized for cloning and DNA sequencing. The identity of the tags and their relative abundance were used to create a high-resolution “GST sequence profile” of genomic DNA that can be used to identify and quantify the genome of origin within a given complex DNA isolate. Using Yersinia pestis as a model system, Dunn et al (2002) were able to define regions in a relatively simple genome that might have undergone changes that added or deleted restriction sites. However, the method of Dunn et al (2002) has limited utility for complex genomes such as man, where most structural variations are not revealed by the simple gain or lost of a site for a small number of restriction endonucleases under investigation. Moreover the number of GSTs required to cover a large genome or to analyze multiple samples for even one restriction site is prohibitive. In contrast, the GVT-pairs of the present invention provide the economy and the analytical power to profile complex genomes or to extend analysis to multiple DNA samples.
Versions of a method known as Serial Analysis of Gene Expression (SAGE), first described by Velculescu et al (1995) and Kinzler et al (1995) (U.S. Pat. No. 5,695,937), also made use of a type ITS or a type IIG restriction endonuclease to generate DNA tags (Ng et al, 2005; Wei et al, 2004; Saha et al, 2002). The so termed “SAGE tags” were generated from cDNA templates to provide an assessment of the complexity and relative abundance of cDNA species in a biological sample. Later versions of SAGE referred to as “LongSAGE” made use of Mme I digestion to create sequence tags of 21-bp to tag mRNA transcripts (Saha et al, 2002). The most current refinement termed “SuperSAGE” made use of the type III restriction endonuclease, EcoP15 I, to produce a longer tag of 25- to 27-bp for improved mRNA assignment to the genome (Matsumura et al, 2003). Although the present invention also makes use of type IIS, type IIG or type III restriction endonucleases to generate sequence tags, the resulting GVT-pairs of the invention are fundamentally different from the aforementioned SAGE and GST tags by methods of production as well as by improved informational content. Pairs of spatial linked tags of the present invention offer a marked improvement in efficiency and analytical power over the use of a single unlinked tag for generation of high-resolution physical maps that are particularly useful for characterizing novel genomes or for annotating genomes and DNA samples for fine-structural-variations.
The recent work of Ng et al (2005) described a further development of the SAGE method. The investigators made use of a method pioneered by Collins and Weissman (1984) where circularization of a DNA fragment, also referred to as intra-molecular DNA ligation, was employed to link distal DNA segments together into a vector to produce the so termed “genomic jumping libraries” (Collins et al, 1987). Ng et al circularized individual cDNAs to link their 5′- and 3′-derived SAGE tags together to produce “Paired-End Ditags” (PETs), which are then oligomerized to facilitate efficient sequencing. PETs are useful for genomic annotation by the identification of transcription start sites and poly-adenylation sites of transcription units to demarcate gene boundaries and to aid the identification of their flanking regulatory sequences. While the GVT-pair of the present invention and PET both rely on intra-molecular ligation to achieve linkage of DNA markers, only the GVT-pair of the present invention integrates physical distance and other useful information, such as linkage of adjacent restriction sites, thereby making the GVT-pair unique and useful for detailed genomic structural analysis. Ng et al (2005) did not teach methods to create spatially defined tags or tags based on other criteria as described in the present disclosure, neither did they reveal how genomic fine-structural-variations can be derived using their PET approach, nor other methods to generate sequence tag other than through the exclusive use of the type ITS restriction endonuclease Mme I. Finally, Ng et al (2005) did not anticipate methods to enable the efficient use of the next generation of short read DNA sequencers.
Berka et al (2006) (U.S. Patent Application 2006/0292611) and Kobel et al (2007) recently described a method of pair-end mapping of DNA that is functionally similar to the present invention, but their method differs fundamentally in the spatial orientation of the final tagged DNA product and suffers certain important disadvantages. In the method of Kobel et al (2007) and Berka et al (2006), the workers ligate biotinylated hairpin adaptors to each end of a target DNA insert, where after the molecule is circularized by ligating the adaptor sequences together to bring the original target DNA terminus in close proximity to each other, situating on either side of the newly juxtaposed pair of biotinylated adaptors. The circular molecule is then cleaved randomly to create exposed ends that are of random distance from the termini of the original target DNA insert. Linear DNA fragments so generated are recovered by avidin-affinity chromatography and are sequenced along its entire length.
Kobel et al (2007) made use of the next generation DNA sequencer, GENOME SEQUENCER FLX (Roche Diagnostics, Indianapolis, Ind.; 454 Life Science Corp, Bradford, Conn.) (commonly refer as a “454-sequencer”) to derive the original terminal sequences of the target DNA inserts. However, the resulting products produced as described cannot be interrogated effectively on a SOLEXA GENOME ANALYZER (Illumina, San Diego, Calif.) (commonly refer to as “SOLEXA sequencer”), or on SOLiD sequencer (Applied Biosystems, Foster City, Calif.) of any next generation sequencing platforms that produce “short sequence reads”. The generated DNA products of Kobel et al (2007) and Berka et al (2006) adopt the so called “outside-in” topology whereby the original termini (“outside”) of the target DNA insert are orientated in an inverted position (“in”) separated by a newly juxtaposed biotinylated adaptor pair, which is located randomly within the length of the resulting DNA fragment. As a consequence of the adopted “outside-in” topology compared to the original target DNA termini, sequence determination several hundred bases or more is necessary to sequence across the biotinylated adaptor pair and through the other side of the DNA product in order to determine the terminal sequences of the original target DNA fragment. The majority of products produced in this way are within the four-hundred-bp read length of the 454-sequencer. Short read DNA sequencers, such as the SOLEXA, enjoy a ten-fold or more lower operating cost over the 454-sequencer, but have typical supported read length of 50-base, which is not sufficient to interrogate the products produced by the method of Berka et al (2006) and Kobel et al (2007) with absolute precision. Berka et al (2006) described a variant of their approach whereby the type IIS restriction endonuclease Mme I is used to produce tags of approximately 20-base corresponding to the terminal sequence of original DNA insert. By this approach, the workers fixed the length of the tag to be in range of the DNA sequencing capability of the SOLEXA-type DNA sequencer. However, the tags are still in an “outside-in” topology and the fixed 20-base tags generated by Mme I digestion are simply too short to map unambiguously to complex genomes to be useful as a genomic tool or to aid sequence assembly. Moreover, a fixed tag of 20-base would not enjoy the recent improvements in read length of the next generation short-read DNA sequencers. Currently the supported SOLEXA read length is 50-base from each side of a DNA template, with expected increase to 76-base later in 2009.
The present invention overcomes the aforementioned limitations through: (1) the ability to produce GVT-pairs whereby the spacing of tag-pair members on the target DNA can be engineered from a kb or less to several hundreds of kb or more to tailor detection resolution to suit the analysis of different types of nucleic acids and to suit any given experimental design; (2) considerably more accurate and uniform spacing between tag-pair members for greater analytical precision; (3) the ability to produce genomic tag-pairs based on other criteria besides separation distance, such as the creation of tag-pairs based on the location and/or the relative separation distance of adjacent cleavable endonuclease sites for improved interrogation of the target nucleic acid sample; and (4) adapting the methods of the present invention for use in the next generation massively parallel DNA sequencers for far greater economy. By adopting a so termed “outside-out” topology, whereby the juxtaposed terminal sequence tags (GVT-pairs) retain the same spatial orientation as the termini of the original target DNA insert and through the use of frequent cutting type II restriction endonucleases to generate GVTs of an average length of 100-200-bp, the SOLEXA “pair-end-read” platform can be translated directly into ever longer GVT sequences that are limited only by the actual read length of the instrument.