The number of fully sequenced genomes continues to grow, and with it our understanding of human genetic variation. For example, the 1000 Genomes Project is an international collaboration that seeks to provide a comprehensive description of common human genetic variation by performing whole-genome sequencing of a diverse set of individuals from multiple populations. To that end, the 1000 Genomes Project has sequenced the genomes of over 2,500 unidentified people from about 25 populations around the world. See “A global reference for human genetic variation”, Nature 526, 68-74 (2015). This has led to new insights regarding the history and demography of ancestral populations, the sharing of genetic variants among populations, and the role of genetic variation in disease. Further, the sheer number of genomes has greatly increased the resolution of genome wide association studies, which seek to link various genetic traits and diseases with specific genetic variants.
The current standard format for storing and representing human genetic variation information is Variant Call Format (VCF). VCF is a text file format which stores information about genetic variation as a list of variations from a reference, such as the human genome. A VCF file contains meta-information lines, a header line, and then a plurality of data lines, each data line containing information about a particular position exhibiting variation in the reference sequence. For example, a data line can include the nucleotide sequence at that position and a list of alternative known alleles. The data line can further include information regarding the genotypes of a plurality of individuals at that position with respect to the reference sequence and alternative alleles. Genotypes are expressed as a pair of haplotypes: “0/0” indicates that the individual is homozygous for the reference sequence at that position; “0/1” indicates that the individual is heterozygous and has one chromosome with the alternative allele; and “1/1” indicates that the individual is homozygous for the alternative allele.
VCF is an expressive format that can accommodate multiple samples and is widely used in the community. However, as a text-based format, VCF files are large and slow to parse, especially as the number of genomes in a VCF file increases. File size can be reduced via compression, but this introduces an additional overhead component that makes working with VCF files further resource intensive. A more efficient format is BCF, which encodes VCF fields into a binary format that both reduces the amount of space required and also speeds up access times. For example, BCF can encode a genotype for an individual using only two bytes of information (e.g., “0/1” as “0x02 0x04”). BCF files can be compressed (e.g., by BGZF compression) to reduce their size further; however, like with VCF, compression introduces an overhead component that can slow query speeds. More often, it is convenient and practical to process and stream BCF files uncompressed.
BCF seeks to maximize the efficiency of storing and accessing variant information. However, the storage space required for the format scales linearly with the number of included individuals. Using two bytes per genotype, a single person requires 154 megabytes of storage. One hundred people require 15 gigabytes, and one thousand require 150 gigabytes. As the 1000 Genomes Project has shown, increasing the number of genomes by orders of magnitude greatly improves the power of analysis. Using uncompressed BCF at two bytes per genotype without any additional metadata, ten thousand people would require 1.47 terabytes, one million people would require 147 terabytes, and ten million people would require 1.44 petabytes. At this scale, computing resources can become too costly, and simply querying the data set can take an extraordinary amount of time, impacting meaningful analysis. Accordingly, there is a need for improvements in storing variant information for population sized data sets which does not suffer from the limitations described for the above approaches.