Tandem repeats are dispersed throughout the genome, in and around gene regions. They were first identified as agents of disease ˜20 years ago (Mirkin et al.), and since then, several microsatellite repeats (not all of which are triplets) have been identified as the underlying basis for a wide range of neurological and morphological disorders in humans and other mammals (Lopez et al., Orr et al., Albrecht et al. 2005). In addition to causing disease, microsatellites can exert subtle effects on gene function and quantitative traits (reviewed in Gemayel et al.). Repeats are also mutational hotspots in that their instability can be triggered by nearly any aspect of DNA metabolism, and even transcription or stress (reviewed in Fonville et al.). This sensitivity to defects in repair and cellular insults makes repeats useful markers for genome instability and cancer (Glen et al., Loeb et al.). Further, analyzing repeats in personal genomes promises benefit not just to medical genetics and the diagnosis of repeat-related disorders but also to forensics and genealogy, where shorter and more stable tandem repeats can serve as DNA fingerprints to uniquely identify individuals (Ge et al., Gymrek et al.). The use of accurately and globally measuring tandem repeats spans medicine, genetics and biotechnology; repeats influence clinical and subclinical phenotypes, are signatures for genomic instability and cancer, and are important markers for forensics and genealogy.
Despite their use and biological importance, some repetitive sequences (particularly microsatellites) are challenging to study with short-read sequencing technology, such as next-generation sequencing technology. Genotyping microsatellite repeats from reference-mapped reads is fundamentally distinct from calling SNPs or indels in non-repetitive sequence because there is no sound basis for inferring homology between pairs of aligned repeat units.
Current approaches for identifying repeat mutations include indel genotyping methods implemented in popular software suites, such as GATK (DePristo et al.) and ATLAS2 (Challis et al.), that can reveal indels within repeat regions or the recently reported lobSTR method (Gymrek et al.). Indel callers are ill-suited for identifying repeat mutation. As they do not report repeat genotypes, they can base indel identification on reads that do not fully span the repeat. They also fail to account for the error rates of different repeat types. The mutation rate of microsatellite repeats is influenced largely not only by the length of repeat tract but also by other intrinsic properties, such as the size of the repeated unit and the purity (lack of interruptions) of the repeated sequence (Legendre et al.). A genotyping method that incorporates the mutational properties of repeat sequences will be better able to distinguish false alleles from true heterozygosity. However, the success of a genotyping approach relies on more than just the accurate identification of true alleles—the method must be applicable to the greatest number of loci genome-wide. The lobSTR method, for example, makes microsatellite calls genome-wide (Gymrek et al.); however, it is blind to homopolymers runs (i.e. mononucleotide repeats, which are a common and important source of genetic variation).
Systems and methods that accurately genotype microsatellite repeats are needed.