Nucleic acid sequencing technology has evolved to the point where sequencing large blocks of DNA, whole chromosomes, and even entire complex genomes have become reasonable objectives. Such ambitious sequencing projects involve strategies which require, first, that the subject DNA be cleaved into smaller, more manageable fragments and, second, that these fragments be used to construct a map in which the positions of fragments in the parent molecule are identified.
The map of ordered fragments provides useful information even before any nucleotide sequence is obtained, because the position of genes along the map can be determined. In addition, areas of particular interest in the map can be identified and given priority for sequencing.
The mapping process begins by cutting the parent DNA into fragments and then creating "libraries" of cloned DNA fragments contained in replicable vector molecules. Suitable vectors include yeast artificial chromosomes ("YACs"), cosmids, bacteriophage, and plasmids.
Next, to obtain a map of the location of fragments in the parent DNA molecule, the fragments must be assembled in their proper order, a task which increases in complexity as the size of the parent DNA molecule grows larger.
The ordering of cloned fragments is rendered feasible by the fact that most libraries are "redundant", in that instead of only one parent DNA molecule being cut into pieces, several parent molecules are each cut into fragments at different points. As a result, a library contains overlapping fragments. By identifying fragments that overlap, the parent molecule can be pieced together.
Although reconstruction of the parent molecule by joining consecutive fragments may seem conceptually simple, in practice, particularly for large parent DNA molecules, it is a formidable process. There frequently are hundreds or thousands of cloned fragments which must be placed in order, and, for a number of reasons, false overlaps may be detected and true overlaps may be missed. A number of ordering methods have been designed, each bearing their own particular advantages and disadvantages.
One approach (Olson et al., 1986, Proc. Natl. Acad. Sci. U.S.A. 83:7826-7830) involves cutting the DNA of a cloned fragment with one or more enzymes, termed "restriction endonucleases", which cleave DNA at specific short subsequences. Each cloned fragment will give rise to a characteristic set of so-called "restriction fragments" of particular sizes; this set of restriction fragments is sometimes referred to as a "DNA fingerprint". The cloned fragments are ordered by comparing DNA fingerprints. If certain sizes of restriction fragments are shared among two or more clones, those clones are deemed likely to contain overlapping DNA sequence. Important shortcomings of this method lie, first, in the potential for error associated with determining the sizes of restriction fragments and, second, in the fact that identically sized fragments need not contain the same nucleotide sequence.
Another ordering strategy (Cuticchia et al., 1992, Genetics 132:591-601) orders cloned DNA fragments not by comparing restriction fragment sizes, but by comparing the ability of cloned DNA fragments to hybridize to a series of detectably labelled nucleic acid probes consisting of short pieces of DNA sequence. The ability or inability of a particular clone to hybridize to each member of a panel of different oligonucleotide probes is determined and digitally recorded. The results of hybridization between a single cloned fragment and a multitude of oligonucleotide probes gives rise to a digital signature. Because the degree of overlap between two clones should be reflected by a similarity in their signatures, a number of clones may be ordered by quantitating, by statistical mechanics, and then minimizing the calculated differences in signatures across a reconstructed chromosome. This method, which requires an abundance of data and complicated calculations, is referred to as "simulated annealing".
Mott et al, 1993, Nucleic Acids Research 21:1965-1974 ("Mott") describes a heuristic filtering procedure which calculates the distance between probes (by statistical mechanics) to create a minimum-spanning subset of the probes. Mott's heuristic ordering procedure begins by identifying, for each probe, all neighboring probes (i.e., probes linked by jointly positive clones) and then orders the probes using the following algorithm. Starting with a randomly chosen probe that has not yet been fit into a set of ordered probes (which is to say it is "unordered"), either the most or, alternatively, the least distant neighbor of that probe is identified. Choosing the least distant neighbor builds a more detailed clone order, whereas choosing the most distant neighbor tends to identify a minimal set of probes connected by clones spanning large regions of DNA. If such a neighbor is found, all probes common for both neighborhoods are designated as "ordered", and the "neighbor" probe is then used as a new starting point to look for more most or least distant neighbors. Mott recognizes that false positives (such as those caused by repeated sequences) and false negatives can create problems in determining probe order. As a practical matter, when such inconsistent pieces of data are identified, they are simply ignored. Furthermore, a set of heuristic rules are applied which will identify "suspect" data.
U.S. Pat. No. 5,219,726, entitled "Physical Mapping of Complex Genomes" by Evans, filed Jun. 2, 1989 and issued Jun. 15, 1993, discloses a method of chromosome mapping which involves the simultaneous analysis of multiple cosmid clones for the detection of overlaps. Overlaps between two clones are identified by cross-hybridization rather than by comparing patterns of restriction fragments or hybridization to a panel of probes. Because this strategy does not depend on pattern recognition for detecting overlaps, analysis may be carried out on cosmid clones grouped together. No particular algorithm for ordering is provided.
3. SUMMARY OF THE INVENTION
The present invention relates to a method of ordering DNA fragments in which binary overlap information, obtained by determining whether pairs of fragments hybridize to each other, is used to build a "spanning path" across a parent DNA molecule by successively attaching cloned fragments which overlap. The relative position of the many other, non-path fragments is not essential information, but can be determined. Means for eliminating false positives and false negatives is provided. The overlap relationships between fragments may be depicted as an interval graph. The method of the present invention is time efficient and is more successful at determining the correct order of fragments than prior art methods.
3.1. Definitions
Definition I: Let V be a finite set of intervals of a real line. Then G(V) denotes the graph whose set of vertices is V, where v.sub.1, v.sub.2 .epsilon.V are joined by an edge, if and only if, v.sub.1 .andgate.N v.sub.2 .noteq.0. The graph G(V) is called the interval graph ("IG") associated with V.
Definition 2: Let P be a subset of V. Then G(V,P) denotes the graph whose set of vertices is V, where v.sub.1, v.sub.2 .epsilon.V are joined by an edge if, and only if, the following two conditions are satisfied: EQU v.sub.1 .andgate.v.sub.2 .noteq.0 (i) EQU v.sub.1 .epsilon.P or v.sub.2 .epsilon.P (ii)
The elements of P are called p-vertices. The graph G(V,P) is called the probe interval graph ("PIG") associated with the pair (V,P). Note that G(V,V)=G(V), hence an interval graph may be regarded as a special kind of probe interval graph.
Definition 3: A path a along either an interval or probe interval graph is a finite sequence of vertices, v.sub.1, v.sub.2 . . . v.sub.n, such that v.sub.i is connected to v.sub.i+1 by an edge for all i such that 1.ltoreq.i.ltoreq.n-1. The path .sigma. is said to be a simple path if v.sub.i .noteq.v.sub.j for i.noteq.j. That is, in a simple path a vertex cannot be repeated.
Definition 4: Given the sets of intervals V and V*, then V* is a cover set of V if the union of the intervals in set V is a subset of the union of the intervals in set V*. For example, vertices {36, 19, 28} is a cover set of vertex {21} (FIG. 1).
Definition 5: Let .sigma.=v.sub.1, v.sub.2, . . . v.sub.n be a simple path on a probe interval graph, G(V,P). Let W={w.sub.1, w.sub.2, . . . w.sub.n }.OR right.V be the set of all vertices which have a non-empty intersection with either v.sub.1 or v.sub.n (the boundaries). The expanded union of .sigma. is defined to be the union of all intervals represented by the vertices contained in W.orgate..sigma., that is, the expanded union of .sigma.=w.sub.1 .orgate.w.sub.2 . . . .orgate.w.sub.m .orgate.v.sub.1 .orgate.v.sub.2 . . . .orgate.v.sub.n. Thus, in FIG. 1, .sigma. could be the path {19, 27, 30}, and therefore the expanded union of .sigma. includes the overlaps with the vertices {21, 23, 28, 32}. Thus the expanded union is {19, 21, 23, 27, 28, 30, 32}.
Definition 6: Given a probe interval graph G(V,P), a simple path from one p-vertex to another p-vertex is a spanning path ("SP"), that is, a path-spanning the line as described by G, if the expanded union of this path is a cover set of V, that is, if it includes the boundary vertices.