Much information about a person's health is encoded in their DNA. Next-generation sequencing (NGS) technologies rapidly translate that information from its natural, biological format into files of sequence data that can be examined for disease-associated mutations and other features. However, as DNA sequencing technologies become faster, cheaper, and more accurate, the results produced by those sequencing technologies can become difficult to analyze.
It is now often the case that a researcher or medical professional will have to make sense of raw sequence data that is more complex than the linear sequence of “a gene” or even “a genome”. Most genomic sequencing produces millions of reads that must be assembled together in order to make sense of the data. Due to heterozygosity, somatic mutations, repeated genetic elements, structural variants, sequencing errors, or other factors, sequence reads can be assembled in many ways, some of which have little, or even misleading, informatics content. Moreover, genomic sequencing is often more complex than simply sequencing an individual's genome. For example, researchers will study whole populations of related subjects, or will need to compare those results from one study with those results from another. Unfortunately, comparing one set of results with another often requires data-limiting simplifications. For example, reads are often assembled and then reduced to a consensus sequence for comparison to a reference, thus potentially ignoring sources of heterogeneity within those reads.
Some attempts have been made to represent genetic information using a data structure known as a directed acyclic graph (DAG). However, while a DAG can potentially represent known instances of heterogeneity, simply having a DAG does not address the problem of what to do with numerous complex sets of sequence data.