1. Field of the Invention
The present invention relates to a sequence calibration method, and in particular relates to a seed based sequence calibration method.
2. Description of the Related Art
A. The Development of Single-Molecule Sequencing Technology
Recently, there have been many developmental breakthroughs in methods for genomic sequencing. Traditional sequencing methods use a group of amplified DNA molecules as a replicating template (Nucleic Acids Research 2000, v28, No. 20e87; Nature 2005, v437, 376-380). Sequencing a group of template molecules means to synthesize approximately more than one thousand copies of a DNA template in one reaction at the same time. As the multi-molecule enzymatic reaction can not be 100 percent synchronized, various errors may occur in each individual molecule of a bulk nucleic acid polymerization reaction. Accordingly, interpretation of signals become more and more difficult, as more and more mixed signals are produced due to errors caused by the increasing number of nucleotides joining in a strand. Therefore, by using the traditional sequencing methods, the processing length and accuracy for sequencing are limited. Also, the traditional sequencing methods are complex, which makes subsequent sequence assembly more difficult.
Accordingly, methods for single molecule sequencing use one nucleic acid molecule as a template for a sequencing reaction (Proc. Natl. Acad. Sci., 100: 3960-64, 2003). Thus, the problem associated with interpretation of signals becoming more and more difficult, as more and more mixed signals are produced due to errors caused by the increasing number of nucleotides joining in a strand, may be mitigated. Also, the length of recognizable sequences may be increased.
B. Using Repeated Sequencing Method and the Sequences Calibration Method to Improve the Accuracy of Single Molecule DNA Sequencing
Despite advantages of single molecule sequencing, for raw data, error rates are much higher than that of traditional sequencing methods. Basically, because the signal of a single fluorescent molecule is very weak, random errors produced from a single molecular sequencing reaction are directly presented in raw data. Note that unlike multi-molecular sequencing, for single molecule sequencing, ensemble averaging may not be used. Thus, a low cost, fast and accurate single molecular sequencing is required. For example, for a circular DNA molecule, if sequencing reactions, by rolling-circle amplification, may be repeatedly performed, the probability for random errors may decrease. Basically, repeated readings of a same DNA segment may be calibrated by comparison there among, for error correction. (US2006/0024711 A1, WO2009/017678 A2)
C. Analysis of the Prior Art
The traditional methods for sequence comparison, such as the Smith-Waterman, Needleman-Wunsch, FASTA, BLAST and FLAG methods, deploy “dynamic programming” algorithm and its derivations as kernels. These methods show a computing complexity higher than O(N2) when multiple sequences need to be repeatedly compared. However, these methods, which are based on sequence diversity due to biological evolution, may result in bias, if they are used to compare sequences resulting from multiple reads of one replication template.
Comparative analysis of traditional algorithms for sequence comparison
Needle-Smith-Wunsch.aFASTAbWatermancBLAST2dFLAGeFeaturesGlobalGlobalLocalLocalLocalComplexity>O(N2)>O(N2)>O(N2)>O(N2)O(N *logN)SpeedSlowSlowSlowFastFastResourceHighHighHighHighMedium    The Needleman-Wunsch algorithm    FASTA software    The Smith-Waterman algorithm    The BLAST algorithm    FLAG algorithm
D. Comparison of Method/Algorithm for Calibrating Sequences with Repeated Formats with Other Related Inventions
In order to overcome the difficulties of traditional methods, which include complexity, slow speed and comparative bias, this invention disclosures a seed-based, multi-layer calibration method.
A seed-based, multi-layer calibration algorithm kernel is used. First, sequence seed sets of various lengths are constructed on multiple process layers. Then, sequences are progressively and downwardly calibrated from the set of the longest seeds. Thus, because neither extensions nor best path calculations are needed, the novel method reduces computing complexity and achieves high speeds.
The method of this invention can be applied to a fluorescence detection module output device of a nucleic acid single molecule sequencer based on rolling-cycle replication. The resulting sequence read consists of repeated primer parts, with a known sequence, and target DNA parts. Both parts can be identified by comparing the raw data read to the known primer sequence. Following, the identified parts are subjected to sequence extraction in repeated format. Thereafter, the extracted parts are applied in a sequence calibration process, which includes building seed tables (seed sets) and then comparing sequences thereof. Because the possibility of reading a “wrong” same base at a same position is much lower than reading a “right” same base at a same position, common sequences between two repeated may represent a more likely possibility that the sequence may be the original sequence of the template sequence.
In contrast to a large number of sequence comparisons being required to be performed for traditional methods, the novel method of this invention employs minimal process steps. Additionally, the novel method of this invention uses only seed-set comparisons to achieve high speeds.