Genomic mutational events can cause populations of individuals to exhibit large degrees of genetic variability. Tandem duplication is an example of a genomic mutational event, in which a short sequence of DNA is duplicated and inserted into the genome.
Subsequent duplication events result in a genomic sequence having a repeated pattern of one or more nucleotides, in which the repetitions are adjacent to one another. Such tandem repeats can span from simple, short sequences (e.g., a dinucleotide repeat “ATATAT” or trinucleotide repeat “CAGCAGCAG”), to more complex repetitive sequences with patterns spanning from tens to hundreds of nucleotides. Over time, individual copies within a tandem repeat may undergo additional mutation, resulting in the presence of approximate copies. Tandem repeats have been estimated to occur frequently (e.g., up to 10%) in genomic sequences.
Tandem repeats have been shown to cause human diseases and may play a variety of regulatory and evolutionary roles. Once characterized, they can also be important laboratory and analytical tools. For example, trinucleotide repeats are associated with a variety of diseases, such as fragile-X mental retardation, Huntington's disease, and myotonic dystrophy. Each of these diseases can result from a dramatic increase in copy number of a trinucleotide sequence from the normal range (e.g., tens of copies) to hundreds or thousands. Tandem repeats may also alter the structure of a DNA molecule, altering transcription and translation and ultimately affecting gene expression. Further, tandem repeats are often polymorphic across a population, and thus provide a valuable tool for linkage analysis, DNA fingerprinting, and genealogical DNA testing. Identifying and annotating reference genomes with tandem repeats is also important for next-generation sequencing alignments, in which many short sequence reads are mapped to a reference genome. An aligner that understands which portions of the genome include tandem repeats will be able to better map sequence reads to those regions. Further, regions of the genome having tandem repeats are often misassembled, which provides a useful clue to an aligner or variant caller. Thus, finding, annotating, and characterizing tandem repeats is an important tool.
Despite their simple nature, the detection and accurate characterization of tandem repeats can be a challenging problem. Existing tandem repeat detection techniques, such as Tandem Repeats Finder (TRF) can identify tandem repeats in a given nucleotide sequence by looking for runs of k-mer matches. Such k-mer matches can be found by sliding a window of length k along a nucleotide sequence and noting positions at which identical k-mers occur. Using TRF, whenever a new position is added to a list, an earlier occurrence of the k-mer is identified and the distance between the two sequences is calculated. This distance can be a possible pattern size for a tandem repeat. Distances can be compared to statistical criteria to generate a set of candidate tandem repeats and each candidate can be selected and aligned with the surrounding sequence in the genome to determine whether at least two copies of the pattern align. If an alignment is observed, a tandem repeat is reported. Although TRF does not require having prior knowledge of the pattern or the size of the repeat (i.e. , k), it is computationally intensive and may require an excessive amount of processing time for large genomes. Further, TRF does not appear to be able to identify certain challenging tandem repeats that have longer pattern lengths and/or include excessive variations. For example, TRF appears to be unable to identify tandem repeats that are longer than 2000 base pairs.