Gene localization on chromosomes and an understanding of gene organization within large gene groups have become important areas of study in human genetics. A major application of gene localization is in understanding and predicting certain disease states. For example, translocation of marker genes from one chromosomal region to another may play a role in the development of cancer cells. One of the known oncogenes in man and rodents, termed myc, has been localized to a chromosome region which shows a consistent translocation from its normal chromosomal environment to one of three other chromosomes in certain forms of tumors such as Burkitt's lymphoma. Because of the location of the genes for immunoglobulins was previously known, it could be determined that the myc chromosome segment always became translocated to a second chromosome region containing immunoglobulin genes. Further studies have shown that the myc oncogene is, in fact, located close to the boundary of the translocation point, suggesting that a basic mechanism and causation of this lymphoma is the movement of the oncogene from its normal chromosome environment to an immunoglobulin gene environment in a cell where the immunoglobulin genes are being actively expressed (reviewed in references 1 and 2). Similarly, translocation of the Abl oncogene may be a major determinant of chronic myelocytic leukemia (references 3 and 4).
Another important application of gene localization is in identifying and furthering an understanding of inheritable disorders. Restriction endonuclease analysis of genomic RNA has made it possible to identify DNA polymorphisms which are linked closely to normal or mutated genes associated with available probes (reviewed in references 5, 6). The relationship between DNA polymorphisms and disease states was shown originally in studies on hemoglobinopathies, where certain polymorphisms are more frequent in patients with sickle cell disease, and where certain varieties of thalassemia are more commonly associated with specific combinations of restriction sites in intergenic DNA (references 7, 8). More recently, systematic studies have uncovered polymorphic DNA sites that are linked to and flank the locus of mutations which are responsible for Duchenne's muscular dystrophy (references 9, 10), and a fortuitously discovered probe associated with Huntington's disease has been used to identify polymorphic DNA which is closely linked to the gene responsible for Huntington's disease (reference 11). The probe makes it possible to diagnose people who carry the gene for Huntington's disease before the onset of the disease.
Heretofore, gene localization has been approached either by classical studies on gene linkage related to inheritance, or by microscopy and banding techniques for chromosomes. In the classical genetics approach, the frequency of co-inheritance of one phenotype trait, whose gene location is unknown, with a phenotypic trait whose gene location is known provides a measure of the linkage (distance) between the two genes, and this distance provides a rough measure of the relative chromosome positions of the two phenotypic genes. This classic genetic approach is severely limited in man, where controlled breeding is not possible, and where family studies on the inheritability of phenotypic disorders must therefore be relied on. Family studies in man and even genetic studies in inbred strains of mammals are generally unable to resolve gene linkages located closer than about 5 to 10 million base pairs apart, and can give aberrant results that cannot be readily understood until the actual physical structure of the gene is known. As an example of the latter problem, the I-J gene of suppressor lymphocyte surface antigen was initially considered to be one of the genes of the major histocompatibility complex (MHC), and this error was only corrected when portions of the MHC were actually cloned and partially sequenced (references 12, 13).
Genomic DNA regions of unique sequence can, in principle, be localized on a chromosome by in situ hybridization using single-copy DNA probes. In situ hybridization of nucleic acid probes to spreads of polytene chromosomes in Drosophila have been remarkably successful. The polytene chromosomes, which may be amplified over a thousand fold, allow site-specific binding of up to a thousand or more probes at the same location, making probe detection by autoradiography or by fluorescence or enzyme-reporter microscopy quite straightforward. Unfortunately, in situ hybridization to single-copy genes in human DNA is much more difficult to detect, since only a single site is available for probe binding, and can only be identified autoradiographically with relatively long periods of exposure and by counting grains over many chromosome samples to obtain a sufficient distribution of grains to verify probe localization. With rare exceptions, and particularly where only non-polytene chromosomes are available, the in situ hybridization technique cannot distinguish between sequences located closer than about 5 to 10 million base pairs apart (reference 14), comparable to the resolution achievable with phenotypic markers in classical genetic studies. The in situ hybridization technique for locating genes on a chromosome are also subject to artifactual errors such as a tendency for grains to accumulate at the tip or at the center of a chromosome. Such an artifact may account for the still conflicting data from in situ hybridization studies as to whether the beta globin system is located near the tip of chromosomal 11, or closer to the centromere.
In studies on polymorphic DNA regions, discussed above, it has been possible heretofore to localize identified polymorphisms only in the relatively few chromosome regions for which marker probes have been available, such as in the MHC region. In principle, if a complete family of probes spaced evenly along the genome were available, it would be possible to screen individuals for inherited dominant or even recessive disorders, and by comparing many DNA polymorphic sites in the affected individuals and unaffected family members, to localize and derive markers (probes) closely linked with every disorder. This theory has been discussed previously (reference 15). Since there are approximately 3,000 centimorgans of recombination distance distributed along the human genome, 300 evenly spaced markers would provide a marker for every 10 centimorgans, and 600 markers, for every 5 centimorgans. The probability of recombination between such a polymorphic marker and the given disease marker would be less than 1 in 20 in each generation. This set of 300 or 600 markers would greatly facilitate localization and identification of the precise genetic effects in gene regions responsible for these inheritable effects.
In order to generate such DNA probes for identifying polymorphic fragments by prior art techniques, many random DNA segments must be analyzed to see which ones provide polymorphic markers. Each one of these markers must be localized by the in situ hybridization technique described above, or by techniques involving hybridization and detection in a variety of somatic hybrid cell lines containing various human chromosomes or segments of chromosomes, or by hybridization to probes made from assorted chromosome libraries. The latter method is relatively inefficient due to the small amount of DNA that can be obtained in chromosomal sorting procedures. Statistical studies indicate that 900 or more probes would have to be examined in this way in order to obtain a 98% to 99% coverage of the human genome at the desired space intervals, a task that would be exceedingly difficult at best.
Considering now investigations of gene organization in multi-gene arrangements on chromosomes, the best-studied example is the human MHC, which appears to contain at least 40 to 50 class I-like genes, and at least 12 to 20 class II-like genes or pseudo-genes. It is known that the MHC system is highly polymorphic from individual to individual, and that particular alleles of class I or class II genes are associated with a predisposition towards a wide variety of diseases (references 16, 17). The association of polymorphisms with particular disease states may be due to polymorphisms within the known genes of the MHC, or, alternatively, to polymorphisms in presently unidentified class I or class II-type genes, or possibly unrelated genes interspersed within the class I or class II system. Therefore, a complete characterization of all the genes contained within this cluster, and their linear relationship with one another, would make it possible to predict which genes are most likely to be closely associated with particular diseases.
A study of the relationship among genes in a gene cluster or family can lead to greater understanding of gene diversity, gene interaction, and even the identification of previously unrecognized gene products. It is known, for example, that at least two pituitary hormones are encoded by genes contained in a gene cluster. Mapping the genes in this cluster has the potential to uncover DNA sequences that are potential genes of other known pituitary hormones and also genes for hormone-like substances that have not been previously recognized, but which arose during evolution by tandem duplication of preexisting genes for hormones.
As another example, it is known that there are many interferon-like genes in a cluster for one of the interferon types; similar clusters for interleukin-2 and other lymphokine genes, as well as for colony stimulating factor and nerve growth factors may be identified. Growth factors specific for several different cell types have been reported and it is possible that by mapping genes clustered about the growth factor genes, genes encoding other colony-stimulating factors or the like can be identified.
Similarly genes for additional coagulation factors, serum proteins, protease inhibitors, transcription or replication factors, cell membrane receptors, immunoglobulin variable or constant regions, and other cell type-specific surface antigens could well be identified by a practical method for surveying gene clusters.
The organization of genes within a gene family has been approachable, heretofore, generally at two levels of resolution. One is the resolution which can be obtained by classical studies of gene linkage during inheritance. As noted above, classical genetic techniques are unable to distinguish phenotypic markers located closer than about 5 to 10 million base pairs apart. The second level of resolution is that accessible by more recently developed recombinant DNA techniques. In a typical procedure, a genomic DNA insert which has been identified, for example, by hybridization with a selected gene probe, is characterized as to restriction sites and/or base sequence. Currently, the largest block of DNA that can be cloned intact is about 40 kilobases. The only method available in the prior art of extending the cloned sequence (beyond this 40 kilobase limit) is a technique known as chromosomal walking, in which the ends of the cloned insert are identified, radiolabeled, and used as probes to isolate, from a library of cloned DNA inserts, one or more inserts having a region of overlap with the end region(s) of the original insert. On the average, the radiolabeled end probes will identify inserts whose regions of overlap lie near the midpoint of the overlapping inserts. This means, for inserts of 40 kilobases, each additional insert isolated will extend the map region only about 20 kilobases.
The chromosomal walking technique is obviously quite tedious, in that each extension of the map requires screening a genomic DNA library, characterizing the restriction endonuclease sites and/or sequence of the probe-identified insert to locate the new insert in the map, and may require producing new end probes. Further, if one or more of the probes which are used in the procedure are non-unique sequences, these in turn will select for more than one site and cause apparent branching in the map. The maximum map distance that has been achieved to date by this method is about 200 kilobases, in a molecular map of an immune response region of MHC, in which 18 overlapping inserts were identified (reference 12). This was a particularly favorable system, since several probes scattered through the cluster were available.
It is thus apparent that examining gene relationships in a gene region of up to 200 kilobases is generally difficult and uncertain by prior art methods; and neither classical-genetic nor prior art cloning techniques are suited to resolving gene relationships in the range between about 200 kilobases and up to several thousand kilobases.