1. Field
The present disclosure relates to data compression in next generation sequencing (NGS) in general, and more particularly to a mechanism for generating a pileup file from compressed NGS genomic data.
2. Description of the Related Art
In computational biology, next generation sequencing (NGS) refers to new, high-throughput technology for sequencing DNA or RNA. NGS may be used to analyze the genome of an individual or a collection of individuals to, for example, comprehensively catalog genetic variation in population samples. The NGS-based diagnostics may have a significant impact on prescribing effective treatment to an individual. Such personalization is often based on a set of mutations obtained from analyzing an individual's DNA data through an NGS analysis pipeline. The mutations that characterize the individual's disease help clinicians tailor therapy to that individual. Typically, the NGS methods amplify the DNA molecule being sequenced and divide the replicates into smaller strands called reads (made up of few tens to few thousands of contiguous base pairs). These reads are sequenced and the output is stored in a FASTQ file (unaligned NGS sequence reads+quality data is stored in a FASTQ file).
To analyze the genomic data obtained through NGS sequencing, the reads are first aligned to a reference (indicates a reference standard of the genomic data, which may be understood as the genomic data that represents each species) and then stored in a Sequence Alignment Map (SAM) file. Corresponding to each read, the SAM file has multiple fields such as the read sequence, quality values, read-level quality value, alignment location relative to the reference and a Compact Idiosyncratic Gapped Alignment Report (CIGAR) string. The CIGAR string contains a presentation of differences between the read and the reference. The SAM file may range from several megabytes (MBs) to gigabytes (GBs) in size. Analysis of the genomic data requires steps such as variation calling, which requires a pileup file of the genomic data to be analyzed.
In general, analysis or processing of compressed genomic data such as pileup file generation and variation calling is performed on a binary SAM file called a Binary Alignment Map (BAM) file. The BAM file size may increase up to a few MBs to a few GBs. In the case that the SAM data is compressed, in order to perform variation calling and generate a pileup file, the compressed genomic data of the SAM file is first decompressed and then converted into the BAM format before invoking the pileup and variation calling. This consumes a large amount of memory space and processing power, thereby increasing time for genomic data analysis.