The sequence of nucleotide bases present in strands of nucleotides, such as DNA and RNA, carries the genetic information encoding proteins and RNAs. The ability to accurately determine a nucleotide sequence is crucial to many areas in molecular biology. For example, the study of genetics relies on complete nucleotide sequences of the organism. Many efforts have been made to generate complete nucleotide sequences for various organisms, including humans, mice, worms, flies and microbes.
There are a variety of well-known methods to sequence nucleotides, including the Sanger dideoxy chain termination sequencing technique and the Maxam-Gilbert chemical sequencing technique. However, the current technology limits the length of a nucleotide sequence that may be sequenced. Techniques have been developed to sequence larger nucleotide sequences. In general, these methods involve fragmenting the large sequence into fragments, cloning the fragments, and sequencing the cloned fragments. The sequences can be fragmented through the use of restriction enzymes or mechanical shearing. Cloning techniques include the use of cloning vectors such as cosmids, bacteriophage, and yeast or bacterial artificial chromosomes (YAC or BAC). The nucleotide sequence of the fragments can then be compared, overlapping regions identified, and the sequences assembled to form “contigs,” which are sets of overlapping clones. By assembling the overlapping clones, it is possible to determine the sequence of nucleotide bases of the full length sequence. These methods are well known to those having ordinary skill in the art.
The accuracy of nucleotide sequence data is limited by numerous factors. For example, there may be missing sections due to incomplete representation of the genomic DNA. There may also be spurious DNA sequences intermixed with the desired genomic DNA. Common sources of contamination are vector-derived DNA and host cell DNA. Also, the accuracy of the identification of bases tends to degrade toward the end of long sequence reads. Additionally, repeated sequences can create errors in the re-assembly and/or the mismatching of contigs.
In order to reduce the sequence data errors, sequencing of the fragments is generally performed multiple times. To help reduce errors such as mismatching or misassembly resulting from repeated sequences, the “hierarchical shotgun sequencing” approach (also referred to as “map-based,” “BAC-based” or “clone by clone”) can be used. This approach involves generating and organizing a set of large insert clones covering the genome and separately performing shotgun sequencing on appropriately selected clones. Because the sequence information is local, the issue of long-range misassembly is eliminated and the risk of short-range misassembly is reduced.
Other known sequencing and characterization techniques involve generating restriction fragment fingerprints to determine whether close overlaps are present, thereby assembling the BACs into fingerprint clone contigs. Fingerprint clone contigs can be positioned along the chromosome by anchoring them with sequence-tagged sites (STS) markers from existing genetic and physical maps. These fingerprint clone contigs can be associated with specific STSs by probe hybridization or direct search of the sequenced clones. Clones can also be positioned by fluorescence in situ hybridization. Each of these known techniques are costly and time consuming.
Another approach for characterizing nucleotide sequences involves the use of ordered restriction maps of single molecules. One specific technique used to produce single molecule ordered restriction maps is “Optical Mapping”. Optical mapping is a single molecule methodology for the rapid production of ordered restriction maps from individual DNA molecules. Ordered restriction maps are preferably constructed using fluorescence microscopy to visualize restriction endonuclease cutting events on individual fluorochrome-stained DNA molecules. Restriction enzyme cleavage sites are visible as gaps that appear flanking the relaxed DNA fragments (pieces of molecules between two consecutive cleavages). Relative fluorescence intensity (measuring the amount of fluorochrome binding to the restriction fragment) or apparent length measurements (along a well-defined “backbone” spanning the restriction fragment) have proven to provide accurate size-estimates of the restriction fragment and have been used to construct the final restriction map.
Such restriction map created from one individual DNA molecule is limited in its accuracy by the resolution of the microscopy, the imaging system (CCD camera, quantization level, etc.), illumination and surface conditions. Furthermore, depending on the digestion rate and the noise inherent to the intensity distribution along the DNA molecule, with some probability, one is likely to miss a small fraction of the restriction sites or introduce spurious sites. Additionally, investigators may sometimes (rather infrequently) lack the exact orientation information (whether the left-most restriction site is the first or the last). Thus, given two arbitrary single molecule restriction maps for the same DNA clone obtained this way, the maps are expected to be roughly the same in the following sense—if the maps are “aligned” by first choosing the orientation and then identifying the restrictions sites that differ by small amount, then most of the restrictions sites will appear roughly at the same place in both the maps.
For instance, in the original method, fluorescently-labeled DNA molecules were elongated in a flow of molten agarose containing restriction endonucleases, generated between a cover-slip and a microscope slide, and the resulting cleavage events were recorded by fluorescence microscopy as time-lapse digitized images. The second generation optical mapping approach, which dispensed with agarose and time-lapsed imaging, involves fixing elongated DNA molecules onto positively-charged glass surfaces, thus improving sizing precision as well as throughput for a wide range of cloning vectors (cosmid, bacteriophage, and yeast or bacterial artificial chromosomes (YAC or BAC)).
A DNA sequence map is an “in silico” order restriction map that is obtained for a nucleotide sequence by simulating a restriction enzyme digestion process. The sequence data is analyzed and restriction sites are identified in a predetermined manner. The resulting sequence map has some piece of identification data plus a vector of fragments, whose elements encode the size in base-pairs.
Sequenced clones can be associated with fingerprint clone contigs in the physical map by using the sequence data to calculate a partial list of restriction fragments in silico and comparing that list with the experimental database of BAC fingerprints. Genomic consensus maps are generated from optical maps using, e.g., “Gentig” software which is a conventional software that generates optical ordered restriction maps.
It was previously unknown how to determine the accuracy of the DNA sequence maps. Indeed such determination was either impossible or provided a small level of surety. It is one of the objects of the present invention to enable a validation of the DNA ordered sequence maps against the optical maps. Another object of the present invention is to enable an alignment and reordering of the DNA sequence maps based on the optical mapping.
Approaches to aligning or reconstructing restriction maps have been described in E. W. Myers et al., “An O(N2 lg N) Restriction Map Comparison and Search Algorithm”, Bulletin of Mathematical Biology, 54(4):599-618, 1992; R. M. Karp et al., “Algorithms for Optical Mapping”, RECOMB 98, 1998; Parida, L., A Uniform Framework for Ordered Restriction Map Problems, Journal of Computational Biology, Vol 5, No 4, Mary Ann Liebert Inc. Publishers, pp 725-739, 1998; Gusfield, D., Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997; and Lee, J. K., Dancik, V., and M. S. Waterman, “Estimation for restriction sites observed by optical mapping using reversible-jump Markov Chain Monte Carlo”, J. Comp. Biol., 5, 505-516, 1997. However, none of these publications disclose the novel processes and systems described herein below.