Linkage mapping of genes involved in disease susceptibility and other traits in humans, animals and plants has in recent years become one of the most important engines of progress in biology and medicine. The development of polymorphic DNA markers as landmarks for linkage mapping has been a major factor in this advance. However, current methods that rely on these markers for linkage mapping in humans are laborious, allowing screening of only at most a few markers at a time. Furthermore, their power is limited by the sparsity of highly-informative markers in many parts of the human genome.
Genomic mismatch scanning (GMS) is a positional cloning strategy that has no requirement for conventional polymorphic markers or gel electrophoresis. It isolates fragments of identity-by-descent (IBD) between two related individuals based on the formation of extensive mismatch-free hybrid molecules. The GMS technique is described in U.S. Pat. No. 5,376,526, and is illustrated in FIG. 1 of the drawings accompanying this specification.
Dam methylation of one sample prior to hybridisation permits discrimination of homohybrid duplexes by virtue of methylation-sensitive restriction endonucleases that cleave only fully methylated or fully unmethylated DNA. The MutHLS methyl-directed mismatch repair proteins cleave mismatched heteroduplexes on the unmethylated strand. Except for mismatch-free heterohybrid molecules, all DNA is eliminated by a combination of Exonuclease III digestion and physical separation of single-stranded DNA using binding columns. The selected molecules are amplified by inter-Alu PCR using combinations of generic primers, and subsequently identified by hybridisation to an ordered array of DNA samples representing intervals of the genome.
Because natural polymorphisms occur on average once every several hundred bp i.e. at least once every 1000 bp, heterohybrids that are several kilobases in length and mismatch-free are likely to be IBD. Similarly, non-IBD alleles in sufficiently large heteroduplexes are likely to contain one or more mismatches and will be cleaved by the mismatch repair proteins.
The IBD maps from multiple pairs of affected relatives are combined and the resulting composite map searched for loci where genotypic concordance occurs more frequently than would be expected by chance. These loci represent candidate regions that may harbour the target mutation(s).
The relative recovery of DNA from a locus when the two genomes share an allele IBD compared to the recovery from that locus when the two genomes are not IBD dictates the reliability of the technique. For analyses involving human genomic DNA, enrichment by a factor of 1-2 for 50% of IBD fragments and by a factor of 2-5 for 35% of IBD fragments has been reported. Only 15% of IBD fragments are reported to be enriched by a factor of  greater than 5. Furthermore, the yield of DNA after GMS selection is very poor such that amplification of the selected fragments prior to hybridisation to the array is required.
It is one object of the present invention to provide novel methods of performing genetic analysis to obtain enrichment of fragments of IBD. Thus in one aspect this invention provides a method of performing genomic analysis by:
a) digesting genomic DNA to be compared from two different sources to provide genomic fragments whose average length is greater than the average spacing between natural polymorphisms;
b) combining under hybridisation conditions single strands of the genomic fragments from the two sources;
c) separating heterohybrids from homohybrids; and
d) separating mismatch-free heterohybrids from hybrids with mismatches;
which method comprises ligating an adapter to each end of each genomic fragment produced in step a), said adapter being, in double-stranded mismatch-free form, resistant to nuclease digestion.
The method involves comparing genomic DNA from two different sources, generally two different viral or prokaryote or eukaryote (e.g. human, animal or plant) individuals who share a particular phenotype which may have been acquired from a common ancestor. Phenotypes are observable or measurable characteristics displayed by an organism under a particular set of environmental and/or genetic influences. Hybridisation conditions may depend on the genomic fragments being analysed and will be well known to the skilled reader. As noted, natural polymorphisms occur in human genomic DNA on average once every several hundred bp i.e. at least once every 1000 bp. The genomic DNA of the two individuals to be compared is cut into fragments that are in general longer than this. Thus each genomic fragment contains on average one or more polymorphisms. This may be effected by use of a restriction enzyme (or two or more restriction enzymes) that cuts relatively infrequently. Suitable restriction enzymes include those of type II and also those of type IIS. It would alternatively be possible to effect restriction of the genomic DNA by physical or chemical as opposed to enzymatic means.
An adapter is ligated to each end of each fragment. An adapter is an at least partly double-stranded polynucleotide, generally oligonucleotide, having if required an overhang complementary to the overhang generated by the restriction enzyme. Alternatively, both the fragments and the adapter may have blunt ends for ligation. The adapters may comprise oligonucleotides of an arbitrary sequence that does not render them prone to secondary structure, liable to hinder efficient ligation, amplification, or selection on the basis of mismatch discrimination. Primers, comprising all or part of an adapter sequence, used for amplifying DNA under analysis, are further examined to ensure non-specific amplification is avoided. When in double-stranded mismatch-free form, the adapter is resistant to nuclease digestion, that is to say more resistant than is ordinary DNA. Such resistance can be conferred by providing modified internucleotide linkages e.g. phosphorothioate or methylphosphonate linkages, or by the use of nucleotide analogues that confer nuclease resistance. Preferably however a first adapter ligated to fragments of genomic DNA from the first source contains a mismatch; and a second adapter ligated to fragments of genomic DNA from the second source also contains a mismatch; the two adapters being so designed that the forward strand of one adapter will hybridise to the backward strand of the other adapter to form a mismatch-free heterohybrid. A heterohybrid comprises two strands from different individuals and is contrasted with a homohybrid which comprises two strands from the same individual. The two systems are described in more detail below with reference to FIGS. 2 and 3 of the accompanying drawings, in which:
FIG. 2 shows the use of two adapters each having a mismatch within a section comprising phosphorothioate linkages; and
FIG. 3 shows the use of two different adapters each having a mismatch outside a section having phosphorothioate linkages.
The modified method for affected-pair analyses involves restriction digestion of both genomic DNA samples and ligation of adapter sequences to each. These adapters contain mismatched regions that persist after hybridisation in the homoduplex molecules. By contrast, the adapter sequences are fully complementary in heteroduplexes. Subsequent use of a mismatch recognition protein e.g. T4 endonuclease VII and nuclease digestion results in the elimination of all molecules possessing mismatches. Mismatch-free heteroduplex molecules are resistant to digestion e.g. due to the inclusion of phosphorothioate or methyphosphonate linkages in the adapter sequences that convey protection. These molecules can be amplified efficiently and conveniently with a single primer pair prior to analysis as discussed below.
The ligation of adapters to all fragment ends provides a convenient opportunity to selectively digest homohybrid molecules that are produced by hybridisation of the two DNA samples, and to amplify efficiently the enriched fragments with an appropriate adapter primer. The presence of phosphorothioate or methyphosphonate linkages, or other inhibitory features, at the adapter""s ends provides protection against nuclease digestion. The adapter sequences are designed judiciously to be fully complementary on formation of heterohybrid molecules. In homohybrid molecules, however, the mismatch persists. Strand cleavage of the mismatch at a position proximal to the phosphorothioate or methyphosphonate linkages creates vulnerability to subsequent nuclease digestion and culminates in the elimination of the homohybrid molecules. Phosphorothioate or methyphosphonate protection in heterohybrid molecules persists, however, since strand cleavage does not occur in the absence of a mismatch.
A number of types of mismatched adapter would be appropriate for this purpose and include xe2x80x98Yxe2x80x99 shaped adapters with non-complementary ends (FIG. 2), and adapters with one or more mismatched nucleotides at a position along the adapter""s length (FIG. 3). In the former case, a single strand specific endonuclease may be used to achieve strand cleavage, while T4 endonuclease VII would cleave the mismatch in the latter case. If a 3xe2x80x2-5xe2x80x2 exonuclease is used subsequently to digest the cleaved molecules, oligonucleotide phosphorylation is necessary to ensure that both adapter strands form covalent bonds with each genomic fragment. However, if a 5xe2x80x2 to 3xe2x80x2 exonuclease is employed this may not be necessary. The use of mismatched adapters for selective elimination of homohybrid duplexes as an inherent feature of the mismatch discrimination procedure obviates the need for dam methylation of one genomic sample and subsequent digestion of the hybrid molecules by methylation sensitive restriction enzymes.
Strand scission by the MutHLS mismatch recognition proteins (as used in U.S. Pat. No. 5,376,526) has an absolute requirement for at least one (GATC) site within the mismatched duplex that should be at least 150 base pairs from the fragment end to achieve maximal activity. Only the unmethylated strand is cleaved in a hemimethylated duplex, and the efficiency of this depends on the nature of the mismatch and the context of the surrounding sequence. The enzyme system fails to recognise Cxe2x80xa2C mismatches and insertion/deletion loops of more than four nucleotides. By contrast, T4 endonuclease VII is a mismatch recognition protein that is capable of discriminating all single base mismatches as well as insertion/deletion loops of all sizes. Fragments up to 4kbp have been digested successfully and maximal efficiency of cleavage is achieved when the mismatch is separated from a fragment end by at least nine nucleotides. Suitable buffers include Tris, pH 8, and more preferably phosphate buffers. Although sequence context and the nature of the mismatch also affects the efficiency of T4 endonuclease VII digestion, significant benefits may be achieved by replacement of the MutHLS proteins with this enzyme. Other mismatch recognition/repair proteins may be suitable including Cel1 and T7 endonuclease I. The choice of methods for separation of mismatched fragments from matched fragments is not limited to the use of enzymes, but may also be accomplished by chemical or physical means.
It is likely that elimination of cleaved duplexes by nuclease digestion will be more efficient than relying on their physical separation with single stranded DNA binding columns. One or more enzymes that provide single-strand specific endonuclease activity and either 5xe2x80x2-3xe2x80x2 or 3xe2x80x2-5xe2x80x2 exonuclease activity may be appropriate. In addition, since T4 endonuclease VII may in some circumstances create single strand scission, it is important that the exonuclease is active at a nick. Furthermore, in order to preserve the heteroduplex molecules, the exonuclease must be inhibited by phosphorothioate or other modified linkages. Suitable candidates for use either singularly or in combination include, but are not limited to, Bal3I nuclease, S1 nuclease, Mung bean nuclease, T7 gene 6 exonuclease, Exonuclease III and the 3xe2x80x2-5xe2x80x2 exonuclease activity of polymerases, such as T4 DNA polymerase.
Identification of candidate disease loci using the existing GMS method typically requires the analysis of more than 200 affected pairs and the hybridisation of the enriched fragments to an array of genomic clones. The candidate region is determined by scrutiny of the composite map of enriched fragments, constructed from the cumulative data of all affected-pair analyses, and identification of regions where genotypic concordance occurs more frequently than would be expected by chance.
The need for the numerous separate pair-wise analyses and subsequent hybridisation steps could be avoided if a large number of affected individuals was analysed en masse. Accurate diagnosis of phenotype would be an important preliminary step. However, provided that the same sequence variant was common to all, e.g. because all, or the majority, of the affected individuals had acquired their phenotype through common ancestry, a candidate region could be identified in a single analysis.
It is another object of this invention to meet this need. In this aspect the invention provides a method of performing genomic analysis by:
i) providing genomic DNA, pooled from a plurality of individuals that share a phenotype;
ii) digesting the genomic DNA to provide genomic fragments whose average length is greater than the average spacing between natural polymorphisms;
iii) ligating an adapter to each end of each genomic fragment produced in step ii), said adapter being, when in double-stranded mismatch-free form, resistant to nuclease digestion;
iv) denaturing and re-annealing the mixture of adapter-terminated genomic fragments produced in step iii);
v) removing from the mixture produced in step iv) hybrids containing mismatches and if required amplifying mismatch-free hybrids;
vi) and repeating steps iv) and v) to recover one or a few mismatch-free hybrids associated with the phenotype.
Reference is directed to the accompanying FIG. 4 which is a diagram showing this technique.
A suitable protocol involves the pooling of genomic DNA samples of affected individuals e.g. of presumed common ancestry and restriction digestion of the genome pool. A single adapter, comprising complementary oligonucleotides that convey phosphorothioate or methylphosphonate or other protection, is ligated to all fragments prior to denaturation and re-annealing of the pool. Provided that a large number of individuals contributed to the pool, most fragments will form heteroduplexes on hybridisation. Mismatched molecules are eliminated by use of a mismatch repair protein e.g. T4 endonuclease VII and nuclease digestion. The remaining molecules are amplified using a single primer appropriately designed to complement the adapter sequence. The amplified products are subjected to reiterated rounds of mismatch discrimination, resulting in depletion of mismatched heteroduplex molecules and enhanced enrichment of IBD fragments. The number of cycles may depend on the number and similarity (or relatedeness) of the individuals involved. Finally, the selected fragments may be analysed further e.g. by hybridisation to reference sequences of nucleic acid. Alternatively, if the enrichment of IBD fragments by reiterated mismatch discrimination is sufficient to effectively exclude all non-informative fragments, the selected molecules may be directly cloned and sequenced. In addition to eliminating the need for multiple affected-pair analyses, therefore, the requirement for an array of genomic clones would be abolished.
In another aspect, this invention provides a set of four oligonucleotides, wherein each oligonucleotide of the set: is complementary to a first other oligonucleotide of the set and forms therewith a hybrid that is resistant to nuclease digestion; and is substantially complementary to a second other oligonucleotide of the set. Preferably each oligonucleotide comprises one or more phosphodiester bonds selected from phosphorothioate and methylphosphonate.
In another aspect, this invention provides a kit for performing a method as defined, which kit comprises this set of four oligonucleotides together with a ligase and a nuclease.
Using the original GMS method, large tracts of identical-by-descent DNA can be enriched. Considerable effort is required subsequently to analyse these candidate sequences and identify any sequence variants that they may contain. The larger the candidate sequences, the greater is the effort required to scrutinise them for sequence variants. A method that generates very short candidate sequences, therefore, will provide considerable advantage. Moreover, the method would be especially suited to the analysis of all sequence differences in both DNA and RNA.
It is another object of this invention to meet this need. In this aspect the invention provides a method of performing genomic analysis by:
i) providing first nucleic acid, pooled from a plurality of individuals that share a phenotype;
ii) digesting the said first nucleic acid to provide fragments who""s average length is about equal to or less than the average spacing between natural polymorphisms;
iii) ligating an adapter to each end of each fragment produced in step ii) to form adapter-terminated nucleic acid fragments which are, when in double-stranded mismatch-free form, resistant to nuclease digestion;
iv) denaturing and re-annealing the mixture of adapter-terminated nucleic acid fragments produced in step iii);
v) removing from the mixture produced in step iv) hybrids containing mismatches and if required amplifying mismatch-free hybrids;
vi) repeating steps iv) and v) to recover a first mixture of mismatch-free hybrids;
vii) providing second nucleic acid pooled from a plurality of individuals that do not share the same phenotype;
viii) subjecting the nucleic acid of vii) to the said steps ii) to vi) to recover a second mixture of mismatch-free hybrids;
ix) combining under hybridisation conditions single strands of the said first mixture of mismatch-free hybrids and the said second mixture of mismatch-free hybrids;
x) and recovering nucleic acid fragments that do not form mismatch-free hybrids and are associated with the phenotype.
Step ii) may be effected by the use of at least one restriction enzyme that cuts relatively frequently. Thus the majority of fragments will not contain any natural polymorphism.
Reference is directed to FIG. 5 which is a diagram showing this technique.
If the genomes of affected individuals are restricted with one or more enzymes that cleave nucleic acid frequently, a pool of very short fragments will result. The number fragments generated in this way will exceed the total number of polymorphic sequences within the genome. As such, when dissociated and allowed to re-anneal, most fragments will form perfectly matched heteroduplex molecules. It is preferred to have as close to one polymorphism per restriction fragment, but preferably no more, as achievable. With smaller fragments the proportion of identical fragments that contribute xe2x80x98noisexe2x80x99, from which the informative fragments mismatched between the two pools must be differentiated, in the method increases. With larger fragments the proportion of fragments with greater than one polymorphism increases and hence the likelihood of losing fragments that contain the informative sequence change: because the neighbouring polymorphism(s) in the same fragments may not be identical in the pool of individuals.
A single adapter, containing phosphorothioate or methylphosphonate linkages to provide protection to nuclease digestion, is ligated to all nucleic acid fragments. These fragments are dissociated and re-annealed, and mismatched molecules are cleaved by a mismatch repair protein e.g. T4 endonuclease VII. The cleaved molecules are eliminated by one or more nucleases that provide endonuclease and 5xe2x80x2 to 3xe2x80x2 or 3xe2x80x2 to 5xe2x80x2 exonuclease activities. This process of strand dissociation and re-annealing, followed by mismatch discrimination using T4 endonuclease VII and appropriate nucleases is reiterated.
Nucleic acid of wild type individuals is pooled, restricted, ligated to adapters and subjected to reiterated mismatch discrimination, in a similar manner to that of the affected individuals. In each separate pool, therefore, only fragments that contain sequences common to all individuals in the pool should persist.
The enriched fragments of the affected pool are hybridised to an excess of the enriched fragments of the wild type pool. Provided that the individuals contributing nucleic acid to each pool were taken from the same population e.g. who share the same ethnic origin, the vast majority of fragments should form perfectly matched duplexes. Only the fragment that harbours the causative mutation distinguishing the phenotypes should form a mismatched duplex on hybridisation. These mismatched molecules are selected. Completion of the protocol, therefore, culminates in very short genomic fragments potentially containing the sequence variant of interest. These selected fragments can then be analysed with relative ease, e.g. by hybridisation to reference sequences of nucleic acid, to identify the informative sequence change.
The methods described above are preferably carried out with genomic DNA that represents part or all of a genome. A genomic subset may be generated for analysis by one of a number of approaches known to the skilled individual including, but not limited to, selective amplification by techniques such as interAlu-PCR, or confining an analysis to fragments, produced by restriction enzyme digestion, that lie within a predefined size range. A fraction of the genome to be analysed may be selected on the basis of expression in tissues of interest. In this instance mRNA may first be converted to cDNA using conventional methods prior to analysis as described above. Alternatively, RNA may be subjected to analysis as described above.
Cloning and sequencing is the preferred method for analysing sequences that remain at the end of the methods. It is, however, also possible to perform this analysis by hybridisation to reference sequences of nucleic acid including genomic DNA, cDNA or oligonucleotide representations thereof. Examples include hybridisation to arrays of nucleic acid sequences comprising of BAC or cDNA clones, oligonucleotides or chromosomes (See Boyle, et al. (1990) Genomics 7:127-130; Lichter, et al. (1990) Proc. Natl. Acad. Sci. USA 87:6634-6638; Schena, et al. (1995) Science 270:467-470; Lockhart, et al. (1996) Nature Biotechnology 14:1675-1680).