Comparisons of large sets of data to detect regions of similarity among them has many practical applications. For example, natural-language word sequences appearing in stories, articles, term papers, etc., may be compared to determine whether material has been copied or plagiarized from one to another. Computer code listings can be compared to assess whether or not code in one listing has been copied from another listing. In the fields of biology and bioinformatics, sequences of genes, nucleotides, amino acids, etc., that appear in different biological structures may be compared to detect if homology exists that can indicate similar biological functions or that the sequences are related to a common origin.
Sequence alignment tools have incorporated many innovations to greatly advance detailed comparative genomics studies. For example, recent segmental duplications in mammalian genomes (with identity level >90%) can be detected using BLAST or other sequence alignment tools.
However, many of these tools used in genomic applications may use exact or inexact k-mers as homology seeds for local alignment extension. Thus there may be a compromise between sensitivity and computational efficiency when they are used to detect homologous segments, segmental duplications, or homology-based phylogenic distances, as homology levels become lower.
To improve sensitivity, many of these tools rely on exhaustive searches of exact matches with short mers or inexact matches with longer mers. This approach can yield a large number of false-positives, which may require subsequent post-processing filtering, which can be expensive. Alternatively, more stringent search criteria (e.g., longer mers with more exact matches) can be used to improve efficiency. However, these algorithms may fail to detect low-homology regions such as, for example, ancient duplication events. In order to detect less-recent duplications, orthologous genes have been used as “anchors” to map out duplication blocks. However, this approach may not be suitable for identifying duplications that are not subject to a strong selection process, e.g., sequences containing only non-coding regions.