The present invention relates to DNA probes and probe clusters useful for studying gene localization and organization.
The following references are referred to by corresponding number herein:
1. Robertson, M., Nature, 306:733 (1983). PA0 2. Rabbitts, T. H., et al., Nature, 306:760 (1983). PA0 3. Heisterband, N., et al., Nature, 306:239 (1983). PA0 4. Bartram, C. R., et al., Nature, 306:277 (1983). PA0 5. Kan, Y. W., et al., Proc. Nat. Acad. Sci. USA, 75:5631 (1983). PA0 6. Humphries, S. E., et al., Med. Bull., 39:343 (1983). PA0 7. Orkin, S. H., et al., Nature, 296:627 (1982) PA0 8. Orkin, S. H., et al., Prog. Hematol., 13:49 (1983) PA0 9. Davies, K. E., et al., Nucleic Acids Res., 11:2303 (1983) PA0 10. Murray, J. M., et al., Nature, 300:69 (1982) PA0 11. Gusella, J. F., et al., Nature, 306:234 (1983). PA0 12. Steinmetz, M., et al., Nature, 300:35 (1982). PA0 13. Hayes, C. E., et al., Science, 223:559 (1984). PA0 14. Zabel, B. H., et al., Proc. Nat. Acad. Sci. USA, 80:6932 (1983). PA0 15. Botstein, D., et al., Am. J. Human Genetics, 32:314 (1980). PA0 16. Moller, G. (ed), Immuno. Rev., 70 (1983). PA0 17. Erlich, H. A., et al., Proc. Nat. Acad. Sci. USA, 80:2300 (1983). PA0 18. Maniatis, T., et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, 280 (1982). PA0 19. Blin, N., et al., Nucleic Acids Research, 3:2302 (1976). PA0 20. Maniatis, ibid, 282. PA0 21. Fangman, W. L., Nucl. Acids Res., 5:653 (1978) PA0 22. Schwartz, D. C., et al., Cold Sp.HSQB, 7:189 (1983; Schwartz, D. C. and Cantor, C. R., Cell, (May, 1984) PA0 23. Dugaiczyk, A., et al., J. Mol. Biol., 96:171 (1975) PA0 24. Ryan, M. J., et al., J. Biol. Chem., 254:5817 (1979) PA0 25. Maniatis, ibid, p. 300 PA0 26. Langer, P. R., et al., Proc. Nat. Acad. Sci. USA 78:6633 (1981) PA0 27. Hood, L., et al., Cell, 28:685 (1982) PA0 28. Maniatis, ibid, 187, 211 PA0 29. Kraus, J. and Rosenberg, L. E., Proc. Natl. Acad. Sci. USA, 79:4015 (1982) PA0 30. Sood, A. K., et al., Proc. Natl. Acad. Sci. USA, 78:616 (1981) PA0 31. Das, H. K., et al., Proc. Natl. Acad. Sci. USA, 80:1531 (1983) PA0 32. Seed, B., Nucleic Acids Res., 11:2427 (1983) PA0 33. Barker, D., et al., Cell, 36:131 (1984) PA0 34. Steinmetz, M., et al., Science, 222:727 (1983) PA0 35. Kaufman, J. F., et al., Cell, 36:1 (1984) PA0 35 a. Gladstone, P., et al., Proc. Natl. acad. Sci. USA, 79:1235 (1982) PA0 36. Maniatis, ibid, 284 PA0 37. Ish-Horowicz, et al., Nucleic Acids Res., 9:2989 (1981) PA0 38. Grosveld, F. G., et al., Nucleic Acids Res., 10:6715 (1982) PA0 39. Maniatis, ibid, 115 PA0 40. Maniatis, ibid, 382 PA0 41. Maniatis, ibid, 109 PA0 42. Maniatis, ibid, 304 PA0 43. Maniatis, ibid, 392 PA0 44. Godson, G. N. in Methods of DNA and RNA Sequencing, ed. Weissman, S. M. (Praeger, N.Y., N.Y.), pp 69-111 (1983). PA0 45. Bayer, E. A. and Wilchek, M., Methods Biochem. Analysis, 26:1 (1980) PA0 46. Harper, M. E. and Saunders, G. F., Chromosoma (Berlin), 83:431 (1980)
Gene localization on chromosomes and an understanding of gene organization within large gene groups have become important areas of study in human genetics. A major application of gene localization is in understanding and predicting certain disease states. For example, translocation of marker genes from one chromosomal region to another may play a role in the development of cancer cells. One of the known oncogenes in man and rodents, termedmyc, has been localized to a chromosome region which shows a consistent translocation from its normal chromosomal environment to one of three other chromosomes in certain forms of tumors such as Burkitt's lymphoma. Because the location of the genes for immunoglobulins was previously known, it could be determined that the chromosome segment always became translocated to a second chromosome region containing immunoglobulin genes. Further studies have shown that the myc oncogene is, in fact, located close to the boundary of the translocation point, suggesting that a basic mechanism and causation of this lymphoma is the movement of the oncogene from its normal chromosome environment to an immunoglobulin gene environment in a cell where the immunoglobulin genes are being actively expressed (reviewed in references 1 and 2). Similarly, translocation of the Ab1 oncogene may be a major determinant of chronic myelocytic leukemia (references 3 and 4).
Another important application of gene localization is in identifying and furthering an understanding of inheritable disorders. Restriction endonuclease analysis of genomic DNA has made it possible to identify DNA polymorphisms which are linked closely to normal or mutated genes associated with available probes (reviewed in references 5,6). The relationship between DNA polymorphisms and disease states was shown originally in studies on hemoglobinopathies, where certain polymorphisms are more frequent in patients with sickle cell disease, and where certain varieties of thalassemia are more commonly associated with specific combinations of restriction sites in intergenic DNA (references 7, 8). More recently, systematic studies have uncovered polymorphic DNA sites that are linked to and flank the locus of mutations which are responsible for Duchenne's muscular dystrophy (references 9, 10), and a fortuitously discovered probe associated with Huntington's disease has been used to identify polymorphic DNA which is closely linked to the gene responsible for Huntington's disease (reference 11). The probe makes it possible to diagnose people who carry the gene for Huntington's disease before the onset of the disease.
Heretofore, gene localization has been approached either by classical studies on gene linkage related to inheritance, or by microscopy and banding techniques for chromosomes. In the classical genetics approach, the frequency of co-inheritance of one phenotypic trait, whose gene location is unknown, with a phenotypic trait whose gene location is known provides a measure of the linkage (distance) between the two genes, and this distance provides a rough measure of the relative chromosome positions of the two phenotypic genes. The classical genetic approach is severely limited in man, where controlled breeding is not possible, and where family studies on the inheritability of phenotypic disorders must therefore be relied on. Family studies in man and even genetic studies in inbred strains of mammals are generally unable to resolve gene linkages located closer than about 5 to 10 million base pairs apart, and can give aberrant results that cannot be readily understood until the actual physical structure of the gene is known. As an example of the latter problem, the I-J gene of suppressor lymphocyte surface antigen was initially considered to be one of the genes of the major histocompatibility complex (MHC), and this error was only corrected when portions of the MHC were actually cloned and partially sequenced (references 12, 13).
Genomic DNA regions of unique sequence can, in principle, be localized on a chromosome by in situ hybridization using single-copy DNA probes. In situ hybridization of nucleic acid probes to spreads of polytene chromosomes in Drosophola have been remarkably successful. The polytene chromosomes, which may be amplified over a thousand fold, allow site-specific binding of up to a thousand or more probes at the same location, making probe detection by autoradiography or by fluorescence or enzyme-reporter microscopy quite straightforward. Unfortunately, in situ hybridization to-single-copy genes in human DNA is much more difficult to detect since only a single site is available for probe binding, and can only be identified autoradiographically with relatively long periods of exposure and by counting grains over many chromosome samples to obtain a sufficient distribution of grains to verify probe localization. With rare exceptions, and particularly where only non-polytene chromosomes are available, the in situ hybridization technique cannot distinguish between sequences located closer than about 5 to 10 million base pairs apart (reference 14), comparable to the resolution achievable with phenotypic markers in classical genetic studies. The in situ hybridization technique for locating genes on a chromosome are also subject to artifactual errors such as a tendency for grains to accumulate at the tip or at the center of a chromosome. Such an artifact may account for the still conflicting data from in situ hybridization studies as to whether the beta globin system is located near the tip of chromosomal 11, or closer to the centromere.
In studies on polymorphic DNA regions, discussed above, it has been possible heretofore to localize identified polymorphisms only in the relatively few chromosome regions for which marker probes have been available, such as in the MHC region. In principle, if a complete family of probes spaced evenly along the genome were available, it would be possible to screen individuals for inherited dominant or even recessive disorders, and by comparing many DNA polymorphic sites in the affected individuals and unaffected family members, to localize and derive markers (probes) closely linked with every disorder. This theory has been discussed previously (reference 15). Since there are approximately 3,000 centimorgans of recombination distance distributed along the human genome, 300 evenly spaced markers would provide a marker for every 10 centimorgans, and 600 markers, for every 5 centimorgans. The probability of recombination between such a polymorphic marker and the given disease marker would be less than 1 in 20 in each generation. This set of 300 or 600 markers would greatly facilitate localization and identification of the precise genetic effects in gene regions responsible for these inheritable effects.
In order to generate such DNA probes for identifying polymorphic fragments by prior art techniques, many random DNA segments must be analyzed to see which ones provide polymorphic markers. Each one of these markers ust be localized by the in situ hybridization technique described above, or by techniques involving hybridization and detection in a variety of somatic hybrid cell lines containing various human chromosomes or segments of chromosomes, or by hybridization to probs made from assorted chromosome libraries. The latte method is relatively inefficient due to the small amout of DNA that can be obtained in chromosomal sorting pocedures. Statistical studies indicate that 900 or ore probes would have to be examined in this way n order to obtain a 98% to 99% coverage of the human genome at the desired space intervals, a task thar would be exceedingly difficult at best.
Considering now investigations of gene relationships in m-lt-gene arrangements on chromosomes, the best-studied example is the human MHC, which appears to contin at least 40 to 50 class I-like genes, and at least 15 to 20 class II-like genes or pseudo-genes. It is nown that the MHC system is highly polymorphic from individual to individual, and that particular allels of class I or class II genes are associated with a predisposition towards a wide variety of diseases (references 16, 17). The association of polymophisms with particular disease states may be due to polymorphisms within the known genes of the MHC, or, alternatively, to polymorphisms in presently unidentiied class I or class II-type genes, or possibly unelated genes interspersed within the class I or class II system. Therefore, a complete characterization of all the genes contained within this cluster, and their linear relationship with one another, would make it possible to predict which genes are most likely to be closely associated with particular diseases.
A study of the relationship among genes in a gene cluster or family can lead to greater understanding of gene diversity, gene interaction, and even the identification of previously unrecognized gene products. It is known, for example, that at least two pituitary hormones are encoded by genes contained in a gene cluster. Mapping the genes in this cluster has the potential to uncover DNA sequences that are potential genes of other known pituitary hormones and also genes for hormone-like substances that have not been previously recognized, but which arose during evolution by tandem duplication or pre-existing genes for hormones.
As another example, it is known that there are many interferon-like genes in a cluster for one of the interferon types; similar clusters for interleukin-2 and other lymphokine genes, as well as for colony stimulating factor and nerve growth factors may be identified. Growth factors specific for several different cell types have been reported and it is possible that by mapping genes clustered about the growth factor genes, genes encoding other colony-stimulating factors or the like can be identified.
Similarly genes for additional coagulation factors, serum proteins, protease inhibitors, transcription or replication factors, cell membrane receptors, immunoglobulin variable or constant regions, and other cell type-specific surface antigens could well be identified by a practical method for surveying gene clusters.
The organization of genes within a gene family has been approachable, heretofore, generally at two levels of resolution. One is the resolution which can be obtained by classical studies of gene linkage during inheritance. As noted above, classical genetic techniques are unable to distinguish phenotypic markers located closer than about 5 to 10 million base pairs apart. The second level of resolution is that accessible by more recently developed recombinant DNA techniques. In a typical procedure, a genomic DNA insert which has been identified, for example, by hybridization with a selected gene probe, is characterized as to restriction sites and/or base sequence. Currently, the largest block of DNA that can be cloned intact is about 40 kilobases. The only method available in the prior art for extending the cloned sequence (beyond this 40 kilobase limit) is a technique known as hromosomal walking, in which the ends of the cloned insert are identified, radiolabeled and used as probes to isolate, from a library of cloned DNA inserts, one or more inserts having a region of overlap with the end region(s) of the original insert. On the average, the radiolabeled end probes will identify inserts whose region of overlap lies near the midpoint of the overlapping inserts. This means, for inserts of 40 kilobases, each additional insert isolated will extend the map region only about 20 kilobases.
The chromosomal walking technique is obviously quite tedious, in that each extension of the map requires screening a genomic DNA library, characterizing the restriction endonuclease sites and/or sequence of the probe-identified insert to locate the new insert in the map, and may require producing new end probes. Further, if one or more of the probes which are used in the procedure are non-unique sequences, these in turn will select for more than one site and cause apparent branching in the map. The.maximum map distance that has been achieved to date by this method is about 200 kilobases, in a molecular map of an immune response region MHC, in which 18 overlapping inserts were identified (reference 12). This was a particularly favorable system since several probes scattered through the cluster were available.
It is thus apparent that (1) examining gene relationships in a gene region of up to 200 kilobases is generally difficult and uncertain by prior art methods; and (2) neither classical-genetic nor prior art cloning techniques are suited to resolving gene relationships in the range between about 200 kilobases and up to several thousand kilobases.
The present invention provides novel gene probes and gene probe clusters which can be readily designed, according to novel techniques of the invention, for studying questions of gene localization and organization which have been largely inaccessible by prior art genetic analysis methods, discussed above.
A particular object of the invention is to provide a cluster of gene probes for use in localizing a single copy gene region in mammalian and, in particularly, human chromosomes.
Another object of the invention is to provide a method for generating a family of single-copy DNA fragments which are derived from genomic DNA regions located substantially uniformly along the chromosomes of the genome, at an average spacing from one another such that at least one fragment will show some linkage during inheritance to substantially any disease related gene.
Another specific object of the invention is to provide probes, and methods of preparing same, for studying gene relationships and organization particularly within a region between about 50 and 2,000 kilobases.
Yet another object is to provide a system for rapidly surveying an extensive gene cluster to identify other expressed genes.
According to one aspect of the invention, there is formed a novel probe is connected adjacent the downstream end of the downstream end of the second segment, either by direct ligation, or through a marker segment which allows selection and/or isolation of the probe. In one embodiment of the invention, the marker segment includes a suppressor tRNA which allows for selection of a phage vector containing the probe in a suppressor (-) host. In another embodiment, the marker segment includes a cos site which allows for selection of a cosmid vector containing the probe as an insert. In still another embodiment, the marker segment includes a ligand by which the probe can be isolated by specific binding to an anti-ligand.
The gene probe is constructed, according to a method of the invention, by providing randomly sized pieces of genomic DNA, which may be size fractionated to yield a selected size distribution within a range of sizes which may vary from about 20 to 2,000 kilobases. The DNA pieces are ligated under conditions which produce predominantly circularized monomers, and the monomers are digested with one or more selected restriction endonucleases to release fragments containing opposed end segment of the pieces joined at a junction region. The desired gene-probe fragments are selected by the presence of segments which hybridize to at least one end-segment probe and/or by the presence of a marker segment in the fragment.
A gene probe cluster of the invention includes a group of such gene probes, each having a first segment which is complementary to a common gene region of genomic DNA, and a second segment which is complementary to one of a series of second gene regions located downstream of, and at increasingly spaced intervals from the first gene region of the genomic DNA.
The gene probe cluster may be produced by applying the gene probe construction method described above to a series of size-fractionated groups of DNA, or by incorporating a marker segment selectively into different size distributions of unfractionated DNA pieces, as the pieces are being ligated to form circularized monomers.
Also forming part of the invention are novel methods which use the gene probe or probe cluster of the invention to:
1. determine the distance between, and/or orientation of two known genomic gene regions which are separated by a gene spacing of between about 20 and 2,000 kilobases;
2. determine the identity of a gene region which is separated from a known gene region by a gene spacing between about 20 and 2,000 kilobases;
3. generate a series of single-copy probes derived from gene regions which are substantially evenly spaced along genomic DNA by a distance of between about 100 and 2,000 kilobases;
4. localize the chromosomal position of any single-copy gene for which a gene probe exists; and
5. map the identity and positions of genes in a gene family which may cover several thousand kilobases of genome.
These and other objects and features of the invention will become more fully apparent when the following detailed description of the invention is read in conjunction with the accompanying drawings.