This invention relates to the construction of ordered restriction maps and, more particularly, to a method capable of producing high-resolution, high-accuracy maps rapidly and in a scalable manner.
Optical mapping is a single molecule methodology for the rapid production of ordered restriction maps from individual DNA molecules. Ordered restriction maps are constructed by using fluorescence microscopy to visualize restriction endonuclease cutting events on individual fluorochrome-stained DNA molecules. Restriction enzyme cleavage sites are visible as gaps that appear flanking the relaxed DNA fragments (pieces of molecules between two consecutive cleavages). Relative fluorescence intensity (measuring the amount of fluorochrome binding to the restriction fragment) or apparent length measurements (along a well-defined xe2x80x9cbackbonexe2x80x9d spanning the restriction fragment) have proven to provide accurate size-estimates of the restriction fragment and have been used to construct the final restriction map.
Such a restriction map created from one individual DNA molecule is limited in its accuracy by the resolution of the microscopy, the imaging system (CCD camera, quantization level, etc.), illumination and surface conditions. Furthermore, depending on the digestion rate and the noise inherent to the intensity distribution along the DNA molecule, with some probability one is likely to miss a small fraction of the restriction sites or introduce spurious sites. Additionally, investigators may sometimes (rather infrequently) lack the exact orientation information (whether the left-most restriction site is the first or the last). Thus, given two arbitrary single molecule restriction maps for the same DNA clone obtained this way, the maps are expected to be roughly the same in the following sensexe2x80x94if the maps are xe2x80x9calignedxe2x80x9d by first choosing the orientation and then identifying the restrictions sites that differ by small amount, then most of the restrictions sites will appear roughly at the same place in both the maps.
There are two approaches to further improve the accuracy and resolution of the maps: first, improve the chemical and optical processes to minimize the effect of each error source and second, use statistical approaches where the restriction maps of a large number of identical clones are combined to create a high-accuracy restriction map. These two approaches are not mutually exclusive and interesting trade-offs exist that can be exploited fruitfully.
For instance, in the original method, fluorescently-labeled DNA molecules were elongated in a flow of molten agarose containing restriction endonucleases, generated between a cover-slip and a microscope slide, and the resulting cleavage events were recorded by fluorescence microscopy as time-lapse digitized images. The second generation optical mapping approach, which dispensed with agarose and time-lapsed imaging, involves fixing elongated DNA molecules onto positively-charged glass surfaces, thus improving sizing precision as well as throughput for a wide range of cloning vectors (cosmid, bacteriophage, and yeast or bacterial artificial chromosomes (YAC or BAC)). Further improvements have recently come from many sources: development of a simple and reliable procedure to mount large DNA molecules with good molecular extension and minimal breakage; optimization of the surface derivatization, maximizing the range of usable restriction enzymes and retention of small fragments; and development of an open surface digestion format, facilitating access to samples and laying the foundations for automated approaches to mapping large insert clones.
Improvements have also come from powerful statistical tools that process a preliminary collection of single-molecule restriction maps, each one created from an image of a DNA molecule belonging to a pool of identical clones. Such restriction maps are almost identical with small variations resulting from sizing errors, partially digested restriction sites and xe2x80x9cfalsexe2x80x9d restriction sites. Such maps can be combined easily in most cases. However, the underlying statistical problem poses many fundamental challenges; for example, the presence of some uncertainty in the alignment of a molecule (both orientation and/or matching in the sites) in conjunction with either false cuts or sizing error is sufficient to make the problem infeasible (NP-complete). M .R. Garey and D. S. Johnson, Computer and Intractability: A Guide to the Theory of NP-Completeness (W. H. Freeman and Co., San Francisco 1979).
In spite of the pessimistic results mentioned above, the problem admits efficient algorithms once the structure in the input is exploited. For instance, if the digestion rate is quite high, then by looking at the distribution of the cuts a good guess can be made about the number of cuts and then only the data set with large numbers of cuts can be combined to create the final map. J. Reed, Optical Mapping, Ph. D. Thesis, NYU, June 1997 (Expected). Other approaches have used formulations in which one optimizes a cost function and provides heuristics (as the exact optimization problems are often infeasible). In one approach, the optimization problem corresponds to finding weighted cliques; and in another, the formulation corresponds to a 0-1 quadratic programming problem. S. Muthukrishnan and L. Parida. xe2x80x9cTowards Constructing Physical Maps by Optical Mapping: An Effective Simple Combinatorial Approach, in Proceedings First Annual Conference on Computational Molecular Biology, (RECOMB97), ACM Press, 209-215, 1997). However, these heuristics have only worked on limited sets of data and their efficacy (or approximability) remains unproven.
In accordance with our invention, we use a carefully constructed prior model of the cuts to infer the best hypothetical model by using Bayes"" formula. The solution requires searching over a high-dimensional hypothesis space and is complicated by the fact that the underlying distributions are multimodal. Nevertheless, the search over this space can be accomplished without sacrificing efficiency. Furthermore, one can speed up the algorithm by suitably constraining various parameters in the implementation (but at the loss of accuracy or an increased probability of missing the correct map).
The main ingredients of this method are the following:
A model or hypothesis , of the map of restriction sites.
A prior distribution of the data in the form of single molecule restriction map (SMRM) vectors:
Pr[Dj|].
Assuming pair-wise conditional independence of the SMRM vectors Dj,
Pr[Dj|Dji, . . . , Djm, ]=Pr[Dj|].
Thus, the prior distribution of the entire data set of SMRM vectors becomes             Pr      ⁡              [                  𝒟          |          ℋ                ]              =                  ∏        j        m            ⁢              xe2x80x83            ⁢              Pr        ⁡                  [                                    D              j                        |            ℋ                    ]                      ,
where index j ranges over the data set.
A posterior distribution via Bayes"" rule       Pr    ⁡          [              ℋ        |        𝒟            ]        =                    Pr        ⁡                  [                      𝒟            |            ℋ                    ]                    ⁢              Pr        ⁡                  [          ℋ          ]                            Pr      ⁡              [        𝒟        ]            
Here  is the unconditional distribution of hypothesis and r[D] is the unconditional distribution of the data.
Using this formulation, we search over the space of all hypotheses to find the most xe2x80x9cplausiblexe2x80x9d hypothesis  that maximizes the posterior probability.
The hypotheses  are modeled by a small number of parameters "PHgr"() (e.g., number of cuts, distributions of the cuts, distributions of the false cuts, etc.). We have prior models for only a few of these parameters (number of cuts), and the other parameters are implicitly assumed to be equi-probable. Thus the model of r[ is rather simplistic. The unconditional distribution for the data r[D] does not have to be computed at all since it does not affect the choice . In contrast, we use a very detailed model for the conditional distribution for the data given the chosen parameter values for the hypothesis. One can write the expression for the posterior distribution as:
log( less than Pr["PHgr"()|D])=xe2x89xa1l+Penalty+Bias,
where l xe2x89xa1xcexa3j log (r[Dj|"PHgr"()] is the likelihood function, Penalty=log r[{circumflex over ("PHgr")} ] and Bias=xe2x88x92log(r[D])=a constant. In these equations "PHgr"() corresponds to the parameter set describing the hypothesis and {circumflex over ("PHgr")} ()⊂"PHgr"() a subset of parameters that have a nontrivial prior model. In the following, we shall often write  for "PHgr"(), when the context creates no ambiguity.
Also, note that the bias term has no effect as it is a constant (independent of the hypothesis), and the penalty term has discernible effect only when the data set is small. Thus, our focus is often on the term l which dominates all other terms in the right hand side.
As we will see the posterior distribution, r[|D] is multimodal and the prior distribution r[Dj] does not admit a closed form evaluation (as it is dependent on the orientation and alignment with . Thus, we need to rely on iterative sampling techniques.
Thus the search method to find the most xe2x80x9cplausiblexe2x80x9d hypothesis has two parts: we take a sample hypothesis and locally search for the most plausible hypothesis in its neighborhood using gradient search techniques; we use a global search to generate a set of sample hypotheses and filter out all but the ones that are likely to be near plausible hypotheses.
Our approach using this Bayesian scheme enjoys many advantages:
One obtains the best possible estimate of the map given the data, subject only to the comprehensiveness of the model "PHgr"() used.
For a comprehensive model  estimates of "PHgr"() are unbiased and errors converge asymptotically to zero as data size increases.
Additional sources of error can be modeled simply by adding parameters to "PHgr"().
Estimates of the errors in the result can be computed in a straightforward manner.
The algorithm provides an easy way to compute a quality measure.