When a couple wants to have children, they may turn to genetic screening to identify if either member is a carrier of a genetic condition. Genetic carrier screening can be done using next-generation sequencing (NGS) technology, which produces a large number of independent reads, each representing anywhere between 10 to 1000 bases of nucleic acid in a person's genome. Nucleic acids are generally sequenced redundantly so that each gene segment is covered a number of times for confidence (i.e., “10× coverage” or “100× coverage”). Thus, a multi-gene genetic screening can produce millions of reads stored in very large sequence read files.
There are considerable challenges involved in storing and transferring the immense amount of sequencing data generated by NGS technologies. In fact, the costs of file storage and transfer may be a bottleneck that poses a significant obstacle to personalized medicine (see, e.g., Deorowicz, 2013, Data compression for sequencing data, Algorithms for Molecular Biology 8:25). Existing methods for compressing sequencing data are not satisfactory because they create binary files that are not human-readable, are lossy, or are inexorably wrapped into other specialized alignment or reference-mapping programs (see Bonfield, 2013, Compression of FASTQ and SAM format sequencing data, PLoS One 8(3):e59190).