The science of bioinformatics applies sophisticated analytic techniques to biological data, such as genome sequences, to better understand the underlying biology. In order to benchmark and refine bioinformatic techniques, they are applied to known “reference” sequences—e.g., a particular genomic region or even a complete genome. These sequences may be completely empirical, i.e., obtained by biochemically sequencing actual genomes from a population of organisms. Although validation of a bioinformatic technique with empirical data is ultimately essential, as it represents real organisms, there may simply not be enough of it. For example, sequence data for particular genomic regions in a specific subpopulation of interest (e.g., African American veterans with a history of heart disease) may not be widely available, at least not from many individuals, due both to resource limitations and patient-privacy concerns.
If some true sequences are known along with the statistical principles underlying intrasequence variation among individuals within a group, it is possible to generate simulated data for benchmarking and analysis purposes. A small amount of real data, in other words, can be used to generate a large amount of simulated data with reasonable fidelity to the biology of the subpopulation. Therefore, computer simulation of genetic and genomic data has become increasingly popular for assessing and validating biological models or for gaining an understanding of specific data sets.
Human genomic variants are nucleotide sequences that differ from the human genome reference, a sequence of over 3 billion nucleotides represented by the letters A, T, C, and G. Genomic variant types include single-nucleotide polymorphisms, structural variants, insertions and deletions (see definitions below). The variants of a subpopulation define its particular genomic characteristics, and the objective of a simulation is to preserve the frequency of these variants in the simulated data. Current simulation frameworks are limited by the types of variants they incorporate, their scalability, accuracy, speed, and/or their support for relatively small subpopulations within a larger group. Accordingly, there is a need for techniques and systems for generating statistically valid simulated genomic data that respects variant patterns within a subpopulation and which overcome or mitigate these limitations.