Since biological sequences like DNA, RNA, and amino acid sequences, did not arise ab initio, but share a common ancestry and similar selection constraints. Thus, ultimately, elements may be conserved, motifs can be repeated, regions can be hyper-mutated or deleted, and segments can be inserted and reinserted over and over. A key focus in bioinformatics has been to enhance the ability to compare large number of these sequences against each other. This process can start with aligning two or more sequences using an algorithm that optimizes an alignment score, and often ends with organizing a set of sequences in a global tree structure where the tree-distances roughly correspond to the evolutionary distances. Both the score and distance functions can be determined by the underlying stochastic processes modeling genome evolution, and should be represented in a flexible manner in order to be faithful to biology. However, this type of generality can often imply a loss of computational efficiency. Such dilemma can be resolved through a reliance on simple algorithms, quasi-local cost functions (e.g., linear gap penalty), and by applying these algorithms preferably only on short subsequences after most unlikely candidates have been discarded.
To a rough approximation, a DNA sequence alignment problem differs marginally from a protein sequence alignment problem. For example, at a superficial level, DNA alignment is over an alphabet of 4 letters whereas protein alignment is over an alphabet of 20 letters. However, two key differences are that (1) there are 3 bp DNA code per amino acid, and (2) genes in DNA sequences that ultimately become transcribed and translated into proteins can be separated by intergenic regions of few thousands of base pairs that do not get expressed, and possibly, are subject to strikingly different (or no) selection constraints. Thus, these intergenic regions can typically vary to a greater extent in one species compared to another. Therefore, it is possible to expect gap lengths in DNA alignments to be larger, more variable, and have specie-specific distributions. Moreover, these distributions characterizing the gap-lengths may not be memory-less (i.e., may not follow exponential distributions). Thus, the traditional affine (or linear) gap functions used for aligning proteins may be unsatisfactory for DNA sequences, as the ultimate results can be biologically misleading.