Technology invented by Fred Sanger has provided the mainstay of sequencing approaches since its inception in 1977, culminating with the release of the Human Genome sequence in 2000 by Human Genome Sciences. Sanger sequencing remains a valuable and viable research tool, but the challenge of interpreting instrument signals to produce high quality biological indication remain.
Basic operation of the Sanger sequencing equipment produces an electropherogram, a line plot with four traces that traverse horizontally and whose vertical axis records the amplitudes that reflect the level of detection of the four measured genomic bases: G, A, T, C. Progress along the horizontal position records regular digitized samples as the sample medium flows through the detection column, carrying with it the genomic (DNA) content. With high-quality traces, the central region of the electropherogram shows sequence of well-formed (Gaussian) peaks which are evenly spaced and have consistent amplitude, well-modulated above a low murmur of background noise. In some cases, two peaks will rise simultaneously and typically to a lower height than normal. This reports a mixed-base observation.
The ends of these traces record the initial introduction of the sample and the conclusion of the sample and are virtually always of lower quality, having irregular cadence and irregular and frequently lowered amplitude, and higher noise background. In many cases the peaks are no longer Gaussian in shape and may have overlapping regions. Quality trimming is generally applied to the traces to remove these low-quality regions, however it is often a compromise between preserving sequence and rejecting noise.
Low quality traces can occur due to sample contamination, low-quality primers, and many other experimental conditions. These can result in irregularities in otherwise high quality traces. Ink blobs are a characteristic of an irregularity and result in hugely exaggerated peaks which may span multiple underlying high-quality peaks.
Many researchers continue to trust expert human inspection of Sanger electropherograms for interpretation rather than trust to automated processing. The many characteristics of the electropherograms are hard to characterize algorithmically without a large body of highly qualified data to tune and develop the algorithms.
Humans are diploid organisms, meaning that we have two copies of each gene in our chromosome. In many cases these copies are heterozygous, or giving two different alleles of the same genes. With Sanger sequencing the sequence reflects this heterozygous condition with a combination of base values at the differentiating positions that is one allele may have an “A” base while one allele may have a “C” base, and this is represented as the mixed-base “M”.
A goal of genomics analysis is to recover the true sequence of both alleles in the heterozygous case from Sanger sequencing, however the information available at secondary analysis is inadequate for accurate assignment of multiple heterozygous observations to alleles. Additionally, many mixed-base observations are the result of instrument noise and not true biological variation. The predominance of algorithms do not provide mechanism to accurately align mixed-base sequences to a reference sequence.