Determination of sequence of DNA and other nucleic acids is currently mostly performed through the classical dideoxy sequencing method. Dideoxy sequencing reactions provide readouts of short (100-900 bp) sequence fragments per experiment (ref. 20,21).
If the sequence of the original nucleic acid that is targeted for sequencing is longer than the relatively short readout length of dideoxy sequencing reactions, a “shotgun” method is employed. In the first step of the shotgun method, the original nucleic acid is chemically broken in a random fashion into a set of fragments of shorter length. In practice, a large number of original identical nucleic acid molecules is broken simultaneously and independently in the course of a single chemical reaction, thus resulting in a multitude of overlapping fragments. In the second step of the shotgun method, a number of individual fragments are sampled by well established methods, sequenced and assembled, revealing the sequence of the original nucleic acid (ref. 20,21).
The assembly of sequenced fragments proceeds in a domino-like fashion. Sequence similarity between the end of one fragment and the beginning of subsequent fragment guides the assembly process. Fragment assembly is recognized as a major bottleneck in many ongoing large-scale DNA sequencing efforts (ref 20,21).
Most currently available programs for sequence assembly such as Phrap, CAP3, TIGR, and Celera assembler use the classical “Overlap-Layout-Consensus” (“OLC”) fragment assembly method (ref 1, 3, 5, 13, 16, 24). The main problem occurs in the “overlap” step: the overlap comparisons typically result in a large number of either spurious matches or undetected matches between fragments due to the uneven coverage of the target sequence, short overlaps between fragments, dideoxy sequencing errors, and, most importantly, due to the presence of repetitive elements.
To see how the presence of repetitive elements causes an assembly problem, consider a situation where the sequence at the end of one fragment matches the sequence at the beginning of another fragment. This situation may arise either due to the fact that the two fragments overlap in the original sequence or due to the fact that each fragment overlaps with a distinct occurrence of the same repeated sequence but the fragments themselves do not overlap in the original sequence. In the former case, the fragments should be assembled together, while in the latter they should not. The problem is that the currently available methods do not correctly discriminate between the two situations. Thus, a large number of erroneous overlaps induced by the presence of repeats results in an erroneous assembly. The problem is further exacerbated by other factors such as low redundancy of coverage of the region of interest by sequence fragments, uneven coverage by fragments, chimerism of fragments, contamination of fragments by vector and other foreign DNA sequence, and the presence of dideoxy sequencing error.
A number of methods are used in practice in order to reduce the problem posed by the presence of repetitive elements: (1) databases containing sequences of repetitive elements are used to screen repeat-containing fragments prior to the “overlap” stage, thus preventing erroneous assembly due to known repeats (ref. 20,21); (2) repeat-induced false overlaps between fragments are partially eliminated in the “layout” step of the OLC method by heuristic procedures (ref. 16); (3) specialized repeat-finding methods are developed to recognize repeats in the overlap stage of the OLC method and then feed the information about repeat-induced overlaps back to the assembly method (ref. 17); (4) repeat finding is accomplished through detection of “tangle” structures in the Eulerian Path assembly method (ref. 19).
None of the methods have been very successful in dealing with repeats on a large-scale. In fact, it has been recognized that the problem posed by repetitive DNA cannot even in principle be solved by methods that assemble fragments shorter than the length of the longest repeated sequence (ref. 19). Thus, a need has been recognized to obtain and utilize higher-level information about relative positions of sequence fragments in order to enable correct sequence assembly (ref 20,21).
One approach to solving the assembly problem is the so-called double-barrelled DNA sequencing method (ref 22, 12, 15). The method uses the information about approximate distances of fragments that are sequenced from opposing ends of a clone insert of known approximate size. Such fragments have been referred to as “mate pairs”. The distances between mate pairs have been applied in the framework of the same OLC method and have lead to some improvement (ref. 8).
While it provides useful information and helps bridge repeat-induced gaps in the assembly process, the double-barrelled method has a number of disadvantages. First, it is more costly to sequence and correctly track mate pairs than individual random fragments. In order to obtain the distance between mate pairs, clone sizes may need to be measured and controlled. Many clones may be sequenced from one end only, thus resulting in mispaired fragments. Chimeric fragments often result in mispaired mates, thus introducing error in the assembly process.
While the double-barrelled DNA method can recognize and eliminate a fraction of repeat-induced overlaps, it cannot help completely close the gaps in the assembly of repeat-rich regions. However, the information about the distances between mate pairs can at least indicate the approximate size of gaps between assembled “islands”, thus guiding further “gap filling” efforts.
Instead of producing a completely assembled sequence, the double-barrelled DNA sequencing results in a “scaffold” (ref 18), which is defined as a partially assembled sequence consisting of numerous assembled islands that are separated by gaps of known size. The recently published initial draft of human genome sequence is in fact a scaffold consisting of a huge number of assembled islands (ref 21).
Second approach to solving the assembly problem is the so-called “hierarchical.shotgun sequencing” (ref 20), also referred to as “map-based”, “BAC-based”, or “clone-by-clone”. This approach involves generating and organizing a set of large-insert clones (typically 100-200 kb each) covering the target nucleic acid and separately performing shotgun sequencing on appropriately chosen clones. The reasoning behind the method is that, because the sequence information is local, the issue of long-range misassembly is eliminated and the risk of short-range misassembly is reduced.
One problem with the hierarchical shotgun approach is that the shotgun step needs to be repeated for every clone. In order to sequence the genome of a human or a rodent, tens of thousands of clones need to be subjected to the shotgun step. A shotgun at the level of the whole genome is less costly and easier from an operational point of view.
The double-barrell and hierarchical shotgun approaches have been combined in practice. As part of the human genome sequencing effort (ref. 20,21), certain number of fragments are obtained by performing shotgun sequencing of individual clones, another number of fragments are obtained by performing shotgun sequencing at the level of the complete original nucleic acid (e.g., whole genome), and some of the fragments from either set are mate pairs.
The two methods for producing fragment assembly (ref 16,24) that were used in the actual assembly of the initial draft sequence of the human genome (ref. 20,21) implement the OLC method. They utilize the following pieces of information in the assembly process: (1) sequences of fragments of nucleic acids (for example, sequences of fragments obtained by the shotgun method); (2) sequences of repetitive elements that were identified prior to the current assembly (in order to avoid false fragment overlaps in the layout stage); (3) information about false fragment overlaps (for example, about those overlaps induced by the presence of as yet unknown repetitive elements); (4) information about the approximate distance between pairs of fragments (for example, the distance between mate-pairs resulting from the sequencing of the ends of clones of known size as part of the double-barrell shotgun method); (5) information about localization of fragments within the same subregion (for example, within the same BAC clone in case of human genome sequencing); and (6) information about overlaps and the relative order of subregions (for example, overlaps and relative order of BACs).
In addition, to the classical OLC method, Indury and Waterman (ref 6) and Pevzner (ref 19) proposed the Eulerian Path method for sequence assembly. Eulerian Path method is based on the ideas that came from the method of Sequencing By Hybridization (“SBH”), as outlined in U.S. Pat. No. 5,202,231. On one hand, SBH problem is computationally similar to the problem of dideoxy fragment assembly. On the other hand, the SBH fragments are obtained by probe-specific hybridization and are thus much shorter (typically 5-30 bp long) then the fragments obtained by dideoxy sequencing (typically 100-900 bp long) (ref. 10).
The similarity between the two fragment assembly problems is such that OLC method was also initially used in an attempt solve the SBH problem (ref 2, 7). The OLC problem did not scale up computationally. In 1989, Pevzner (ref. 11) proposed the Eulerian Path method for the SBH problem which overcame the scaleup problem, at least in the ideal case of error-free data.
Indury and Waterman (ref 6) applied the Eulerian Path method to the problem of assembling longer fragments obtained by dideoxy sequencing. The first step in the Indury-Waterman method is to break the fragments into yet smaller overlapping subfragments and then, in the second step, to assemble the subfragments by applying the Eulerian Path method proposed by Pevzner. Unfortunately, the Idury-Waterman approach did not scale up well due to sequencing error that occurs in practical dideoxy sequencing. An error-correction step and repeat-detection step have recently be proposed that significantly improve performance of the Euler Path method (ref. 19). Despite the improvements, the Eulerian path method still cannot accommodate information about fragment distances (for example, mate pair information) and cannot calculate scaffolds, and thus it still cannot compete with other assemblers (ref 16, 24) in the assembly of genomes of higher organisms. Nevertheless, the method appears to perform well on the assembly of small genomes (for example, bacterial genomes around 2 mb in size) when the redundancy of fragment coverage is high.
Significantly, none of the assembly methods discussed above utilize the information about known sequences of different but similar nucleic acids. On the other hand, U.S. Pat. No. 6,001,562 discloses a method where sequence similarity between two nucleic acids is determined by aligning fragments from one nucleic acid against the second nucleic acid. However, U.S. Pat. No. 6,001,562 does not teach the method of using the relative position information between distant fragments; nor does it teach the method of inferring repetitive subsequences; nor does it teach localizing fragments to specific cloned DNA fragments; nor the method of inferring relative position of cloned fragments, as described in the present invention.
It follows from the foregoing that none of the prior art references teaches means of using a known sequence of a nucleic acid to drive the assembly of the sequence of another related nucleic acid by inferring distance and orientation of subsequences of the related nucleic acid, with only partial exception of U.S. Pat. No. 6,001,562, which teaches the method for using such information when the subsequences overlap. However, U.S. Pat. No. 6,001,562 does not teach the method for using such information when the subsequences do not overlap but instead occur at a distance from each other Thus, even U.S. Pat. No. 6,001,562 teaches a method of utilizing only a small fraction of the information that is useful in overcoming the critical problem of sequence assembly in the presence of repetitive nucleic acid sequences.