Modern biology, particularly molecular biology, has focused itself in large part on understanding the structure, function, and interactions of essential macromolecules in living organisms such as nucleic acids and proteins. For decades, researchers have developed effective techniques, experimental protocols, and in vitro, in vivo, or in situ models to study these molecules. Knowledge has been accumulating relating to the physical and chemical traits of proteins and nucleic acids, their primary, secondary, and tertiary structures, their roles in various biochemical reactions or metabolic and regulatory pathways, the antagonistic or synergistic interactions among them, and the on and off controls as well as up and down regulations placed upon them in the intercellular environment. The advance in new technologies and the emergence of interdisciplinary sciences in recent years offer new approaches and additional tools for researchers to uncover unknowns in the mechanisms of nucleic acid and protein functions.
The evolving fields of genomics and proteomics are only two examples of such new fields that provide insight into the studies of biomolecules such as DNA, RNA, and protein. New technology platforms such as DNA microarrays and protein chips and new modeling paradigms such as computer simulations also promise to be effective in elucidating protein, DNA and RNA characteristics and functions. Single molecule optical mapping is another such effective approach for close and direct analysis of single molecules. See, U.S. Pat. No. 6,294,136, the disclosure of which is fully incorporated herein by reference. The data generated from these studies—e.g., by manipulating and observing single molecules—constitutes single molecule data. The single molecule data thus comprise, among other things, single molecule images, physical characteristics such as the length, shape and sequence, and restriction maps of single molecules. Single molecule data provide new insights into the structure and function of genomes and their constitutive functional units.
Images of single molecules represent a primary part of single molecule datasets. These images are rich with information regarding the identity and structure of biological matter at the single molecule level. It is however a challenge to devise practical ways to extract meaningful data from large datasets of molecular images. Bulk samples have conventionally been analyzed by simple averaging, dispensing with rigorous statistical analysis. However, proper statistical analysis, necessary for the accurate assessment of physical, chemical and biochemical quantities, requires larger datasets, and it has remained intrinsically difficult to generate these datasets in single molecule studies due to image analysis and file management issues. To fully benefit from the usefulness of the single molecule data in studying nucleic acids and proteins, it is essential to meaningfully process these images and derive quality image data.
Effective methods and systems are thus needed to accurately extract information from molecules and their structures using image data. For example, a large number of images may be acquired in the course of a typical optical mapping experiment. To extract useful knowledge from these images, effective systems are needed for researchers to evaluate the images, to characterize DNA molecules of interest, and to assemble, where appropriate, the selected fragments thereby generating longer fragments or intact DNA molecules. This is particularly relevant in the context of building genome-wide maps by optical mapping, as demonstrated with the ˜25 Mb P. falciparum genome (Lai et al, Nature Genetics 23:309-313, 1999).
The P. falciparum DNA, consisting of 14 chromosomes ranging in size from 0.6-3.5 Mb, was treated with either NheI or BamHI and mounted on optical mapping surfaces. Lambda bacteriophage DNA was co-mounted and digested in parallel to serve as a sizing standard and to estimate enzyme cutting efficiencies. Images of molecules were collected and restriction fragments marked, and maps of fragments were assembled or “contiged” into a map of the entire genome. Using NheI, 944 molecules were mapped with the average molecule length of 588 Mb, corresponding to 23-fold coverage; 1116 molecules were mapped using BamHI with the average molecule length of 666 Mb, corresponding to 31-fold coverage (Id at FIG. 3). Thus, each single-enzyme optical map was derived from many overlapping fragments from single molecules. Data were assembled into 14 contigs, each one corresponding to a chromosome; the chromosomes were tentatively numbered 1, the smallest, through 14, the largest.
Various strategies were applied to determine the chromosome identity of each contig. Restriction maps of chromosomes 2 and 3 were generated in silico and compared to the optical map; the remaining chromosomes lacked significant sequence information. Chromosomes 1, 4 and 14 were identified based on size. Pulsed field gel-purified chromosomes were used as a substrate for optical mapping, and their maps aligned with a specific contig in the consensus map. Finally, for some chromosomes, chromosome-specific YAC clones were used. The resulting maps were aligned with specific contigs in the consensus map (Id at FIG. 4). Thus, in this experiment multi-enzyme maps were generated by first constructing single enzyme maps which were then oriented and linked with one another. Such maps may be linked together by a series of double digestions, by the use of available sequence information, by mapping of YACs which are located at one end of the chromosome, or by Southern blotting.
In short, optical mapping is powerful tool used to construct genome-wide maps. The data generated as such by optical mapping may be used subsequently in other analyses related to the molecules of interest, for example, the construction of restriction maps and the validation of DNA sequence data. There is accordingly a need for systems for visualizing, annotating, aligning and assembling single molecule fragments. Such systems should enable a user to effectively process single molecule images thereby generating useful single molecule data; such systems should also enable the user to validate the resulting data in light of the established knowledge related to molecules of interest. Robustness in handling large image datasets is desired, as is rapid user response.