1. Field of the Invention
The invention is generally related to computational genomics, and is more particularly related to DNA sequence alignment.
2. Background Art
DNA sequence alignment is one of the most important data processing tasks in computational genomics. The task appears in several applications, where efficient manipulation of large data records arranged in several different ways is required. In DNA fingerprinting, for example, an unknown collection of DNA fragments is acquired, typically few tens to few thousands of bases long. This unknown collection is then compared with one of several known collections of DNA fragments contained in a library. Either or both of these collections might be incomplete, unordered, or contain errors, including symbol insertions and symbol deletions. Finding a match between collections establishes genome identity.
A different challenge is posed by the problems of pathogen detection and gene finding. In these cases, instead of a library of known DNA fragments, a specific DNA pattern is often given. The pattern may be a part of pathogen signature or may indicate the start of a coding region. This relatively short sequence is then compared with a sequence that can be several millions of bases long. A match of the pattern with a specific region of the analyzed sequence confirms previous pathogen exposure or identifies an exon.
Related problems occur in comparative genomics and evolutionary tree reconstruction. The goal in these applications is to identify and align islands of similarity in two or more long DNA sequences. Consensus between significant parts of sequences, including coding and conserved noncoding regions, indicates functional relationship or evolutionary proximity.
Many approaches to DNA sequence alignment have been proposed over the last two decades. They include the Needleman-Wunsch (NW) algorithm, the Smith-Waterman (SW) algorithm, FASTA, BLAST, MUMmer, REPuter, and MAFFT. NW, SW, and FASTA are based on dynamic programming; BLAST utilizes a heuristic search; MUMmer and REPuter rely on suffix trees; and MAFFT performs an FFT-based cross correlation. Many other methods belong to one of these four groups. The methods vary in terms of the length of query sequence allowed, the degree of sequence similarity required, treatment of gaps, type of alignment (global or local), and speed and accuracy. Each of these methods has shortcomings, and since the amount of genomic data grows at a much faster rate than improvements in computing technology, new techniques that deliver both computational efficiency and alignment accuracy are desired.