1. Technical Field
The present disclosure relates generally to methods and apparatus for compressing and decompressing genetic information and more particularly to systems and methods for compressing and decompressing sequencing information obtained using a next generation sequencing (NGS) platform or methodology.
2. Description of the Related Art
Parallel sequencing and next generation sequencing (NGS) platforms are rapidly transforming data collection and analysis in genome, epigenome, and transcriptome research fields. NGS technologies have opened fascinating opportunities in life sciences. New fields and applications in biology and medicine are becoming a reality, beyond genomic sequencing.
One application of NGS technologies is variant analysis by aligning the sequencing reads to a reference genome. Due to the high coverage provided by the NGS technologies, the mutations such as SNPs (single-nucleotide polymorphisms), CNVs (copy-number variations) and so on can be detected with high accuracy. These variations can then be analyzed and studied for possible association with pathological conditions like cancer, diabetes, and so on. This has brought the scenario of personalized healthcare and medicine even closer. In a personalized medicine scenario, an immediate access to genomic data in specific areas, for example, genes, axons, and the like, assumes great importance to allow for fast and accurate processing of data so as to detect the mutations or variations of interest.
The number of sequencing reads in NGS files can range from hundreds of millions to billions, depending on the species sequenced and the coverage leading to file sizes of order of MBs (megabytes) to GBs (gigabytes). NGS technology generates huge amounts of genomic data along with multiple annotations, for example, quality scores and other meta-information such as read identifiers, instrument names, flow cell lanes, and so on. The constantly increasing throughput poses challenges on the storage, analysis, and management of the sequencing data. NGS data formats available at present need indexing to allow such an access, adding to the existing problem of managing huge data sizes.
There are several compression methods for NGS data. However, most of the compression methods do not provide access to specific sequencing reads corresponding to a position in the genome. As a result, the file needs to be completely decompressed in order to perform an analysis, even if the target is a small region corresponding to the reference genome.
In view of the above discussion, it is desirable to provide a mechanism that compresses and stores the NGS reads aligned to a reference sequence and provides random access to the reads relative to the reference genome. Furthermore, it is desirable to provide a mechanism through which the reads are selectively decompressed without decompressing the entire file.