Genotyping is the process of characterizing genetic variations in an individual, population, or ecological sample. Genotyping has typically been performed by performing biological assays and comparing the results to a reference genome, e.g., by restriction fragment length polymorphism identification (RFLPI) of genomic DNA, polymerase chain reaction (PCR), and DNA sequencing. More recently, genotyping has been transformed by the development of next-generation sequencing (NGS) technologies. The standard approach to variant discovery and genotyping from NGS data has been to map NGS sequence reads to a linear reference genome to identify positions where the sample contains variations.
Mapping NGS reads to a linear reference genome has limitations. First, the sample may contain sequences absent or divergent from the reference genome, e.g., through horizontal transfer events in microbial genomes, or at highly diverse loci, such as the classical HLA genes. In those cases, short reads either cannot or are unlikely to map correctly to the reference. Second, reference sequences (particularly of higher eukaryotes) are often incomplete, notably in telomeric and pericentromeric regions. Reads from missing regions will often map, sometimes with apparently high certainty, to paralogous regions, potentially leading to false variant calls. Third, methods for variant calling from mapped reads typically focus on a single variant type. However, where variants of different types cluster, focusing on a single type can lead to errors, for example, through incorrect alignment around indel polymorphisms. Fourth, although there are methods for detecting large structural variants, these cannot determine the exact location, size, or allelic sequence of variants. Finally, NGS mapping approaches typically ignore prior information about genetic variation within the species.