The assembly of contiguous cloned genomic reagents is a necessary step in the process of disease-gene identification using a positional cloning approach. The rapid development of high density genetic maps based on polymorphic simple sequence repeats has facilitated contig assembly using sequence tagged site (STS) content mapping. Most contig construction efforts have relied on yeast artificial chromosomes (YACs), since their large insert size uses the current STS map density more advantageously than bacterial-hosted systems. This approach has been validated for multiple human chromosomes with YAC coverage ranging from 65-95% for many chromosomes and contigs of 11 to 36 Mb being described (Chumakov et al., Nature 377 (Supp.):175-297, 1995; Doggett et al., Nature 377 (Supp.):335-365, 1995b; Gemmill et al., Nature 377 (Supp.):299-319, 1995; Krauter et al., Nature 377 (Supp.):321-333, 1995; Shimizu et al., Cytogenet. Cell Genet. 70:147-182, 1995; van-Heyningen et al., Cytogenet. Cell Genet. 69:127-158, 1995).
Despite numerous successes, the YAC cloning system is not a panacea for cloning the entire genome of complex organisms due to intrinsic limitations that result in substantial proportions of chimeric clones (Green et al., Genomics 11:658-669, 1991; Bellanne-Chantelot et al., Cell 70:1059-1068, 1992; Nagaraja et al., Nuc. Acids Res. 22:3406-3411, 1994), as well as clones that are rearranged, deleted or unstable (Neil et al., Nuc. Acids Res. 18:1421-1428, 1990; Wada et al., Am. J. Hum. Genet. 46:95-106, 1990; Zuo et al., Hum. Mol. Genet. 1:149-159, 1992; Szepetowski et al., Cytogenet. Cell Genet. 69:101-107, 1995). At least some of these cloned artifacts are a product of the recombinational machinery of yeast acting on the various types of repetitive elements in mammalian DNA (Neil et al., supra. 1990; Green et al., supra. 1991; Schlessinger et al., Genomics 11:783-793, 1991; Ling et al., Nuc. Acids Res. 21:6045-6046, 1993; Kouprina et al., Genomics 21:7-17, 1994; Larionov et al., Nuc. Acids Res. 22:4154-4162, 1994).
Accordingly, alternative cloning systems must be used in concert with YAC-based approaches to complement localized YAC cloning deficiencies, to enhance the resolution of the physical map, and to provide a sequence-ready resource for genome-wide DNA sequencing. Several exon trapping methodologies and vectors have been described for the rapid and efficient isolation of coding regions from genomic DNA (Auch et al., Nuc. Acids Res. 18:6743-6744, 1990; Duyk et al., Proc. Natl. Acad. Sci., USA 87:8995-8999, 1990; Buckler et al., Proc. Natl. Acad. Sci., USA 88:4005-4009, 1991; Church et al., Nature Genet. 6:98-105, 1994). The major advantage of exon trapping is that the expression of cloned genomic DNAs (cosmid, P1 or YAC) is driven by a heterologous promoter in tissue culture cells. This allows for coding sequences to be identified without prior knowledge of their tissue distribution or developmental stage of expression. A second advantage of exon trapping is that exon trapping allows for the identification of coding sequences from only the cloned template of interest, which eliminates the risk of characterizing highly conserved transcripts from duplicated loci. This is not the case for either cDNA selection or direct library screening.
Exon trapping has been used successfully to identify transcribed sequences in the Huntington's disease locus (Ambrose et al., Hum. Mol. Genet. 1:697-703, 1992; Taylor et al., Nature Genet. 2:223-227, 1992; Duyao et al., Hum. Mol. Genet. 2:673-676, 1993) and BRCA1 locus (Brody et al., Genomics 25:238-247, 1995; Brown et al., Proc. Natl. Acad. Sci., USA 92:4362-4366, 1995). In addition, a number of disease-causing genes have been identified using exon trapping, including the genes for Huntington's disease (The Huntington's Disease Collaborative Research Group, Cell 72:971-983, 1993), neurofibromatosis type 2 (Trofatter et al., Cell 72:791-800, 1993), Menkes disease (Vulpe et al., Nature Genet. 3:7-13, 1993), Batten Disease (The International Batten Disease Consortium, Cell 82:949-957, 1995), and the gene responsible for the majority of Long-QT syndrome cases (Wang et al., Nature Genet. 12:17-23, 1996).
A 700 kb CpG-rich region in band 16p13.3 has been shown to contain the disease gene for .sup..about. 90% of the cases of autosomal dominant polycystic kidney disease (PKD1) (Germino et al., Genomics 13:144-151, 1992; Somlo et al., Genomics 13:152-158, 1992; The European Polycystic Kidney Disease Consortium, Cell 77:881-894, 1994) as well as the tuburin gene (TSC2), responsible for one form of tuberous sclerosis (The European Chromosome 16 Tuberous Sclerosis Consortium, Cell 75:1305-1315, 1993). An estimated 20 genes are present in this region of chromosome 16 (Germino et al., Kidney Int. Supp. 39:S20-S25, 1993). Characterization of the region surrounding the PKD1 gene in 16p13.3, however, has been complicated by duplication of a portion of the genomic interval more proximally at 16p13.1 (The European Polycystic Kidney Disease Consortium, supra. 1994).
This chromosomal segment serves as a challenging test for large-insert cloning systems in E. coli and yeast since it resides in a GC-rich isochore (Saccone et al., Proc. Natl. Acad. Sci., USA 89:4913-4917, 1992) with an abundance of CpG islands (Harris et al., Genomics 7:195-206, 1990; Germino et al., supra. 1992), genes (Germino et al., supra. 1993) and Alu repetitive sequences (Korenberg et al., Cell 53:391-400, 1988). Chromosome 16 also contains more low-copy repeats than other chromosomes with almost 25% of its cosmid contigs hybridizing to more than one chromosomal location when analyzed by fluorescence in situ hybridization (FISH) (Okumura et al., Cytogenet. Cell Genet. 67:61-67, 1994). These types of repeats and sequence duplications interfere with "chromosome walking" techniques that are widely used for identification of genomic DNA and pose a challenge to hybridization-based methods of contig construction. This is because these techniques rely on hybridization to identify clones containing overlapping fragments of genomic DNA; thus, there is a high likelihood of "walking" into clones derived from homologues instead of clones derived from the authentic gene. In a similar manner, the sequence duplications and chromosome 16-specific repeats also interfere with the unambiguous determination of a complete cDNA sequence that encodes the corresponding protein. Furthermore, low copy repeats may lead to instability of this interval in bacteria, yeast and higher eukaryotes.
Thus, there is a need in the art for methods and compositions which enable accurate identification of genomic and cDNA sequences corresponding to authentic genes present on highly repetitive portions of chromosome 16, as well as genes similarly situated on other chromosomes. The present invention satisfies this need and provides related advantages as well.