Genome-wide association studies (GWAS) are now a standard tool that is accessible (in terms of cost and laboratory efforts) for research groups, resulting in imminent demand for high-throughput sequencing data. As a result, the data-management requirements for analysis have changed fundamentally. During the days of candidate gene analyses and linkage analysis, “only” up to several thousands of genetic loci had to be stored and loaded into analysis packages, whereas current GWAS provide genetic information on several million genetic loci. For high-throughput sequencing studies, the number of genetic loci will again increase by several magnitudes.
The standard data format to store genetic information for the analysis in software packages such as, e.g., FBAT, PBAT, or PLINK is the pedigree file type. Pedigree files contain the necessary family/pedigree information and genetic data for the genotyped loci in the study. The pedigree files are usually ACSII text files that are human-readable. However, for GWAS, such pedigree files are often impractically large, which creates not only storage challenges, but can also become a source for potential data management errors. The typical size of a pedigree file can be 1 Gigabyte to 30 Gigabytes for common-variants data (e.g., GWAS data), and can be up to several Terabytes for sequence data. Merely loading such large files can take up to a few hours. This results in great waste of disk space and computation time. Due to the popularity of genome-wide analysis, this problem is encountered routinely, and only a few alternatives are available. As high-throughput sequencing data becomes standard in research, the problem will grow even worse.
One possible solution is to use general-purpose compression software, such as gzip. However, such software is not designed specifically for genotype data and its analysis. Decompression is always needed whenever the data is accessed. Also, general-purpose compression software typically only achieves a compression factor of five to six, and does not support parallel processing. Several solutions that are better tailored to genetic data compression have been proposed. For example, the freely available PLINK and PBAT software packages, which are whole-genome association analysis toolsets, have introduced binary PED formats. This format ensures that only two bits are required for storing the information for one genotype. Furthermore, sophisticated compression techniques designed specifically for sequence data (e.g., DNAzip) have been proposed that can achieve excellent compression rates. However, these techniques generally require a reference human genome and a reference single-nucleotide polymorphism (SNP) map, which entails large storage overhead.
Accordingly, there is a need for more efficient compression and storage techniques tailored to genetic data.