As DNA sequencing costs have plummeted over the last few years, raw data generated by sequencing has increased exponentially, measuring petabytes of data, making analyses and transfer of all this data difficult. These large amounts of data produce a critical bottleneck in the DNA sequencing workflow that has previously only been addressable by throwing increasing numbers of ever more powerful CPU cores at the problem. However, since the data being produced by sequencing already far outpaces Moore's Law, this solution has very limited sustainability.
The hugely parallel approach of NGS requires a human reference genome to be used to reconstruct the patient's genome from the raw read data. The human reference genome has become essential for clinical applications, and is used to identify alleles for risk, protection, or treatment-specific response in human disease. Yet, the current reference genome, GRCh38, being based on a limited number of samples, neither adequately represents the full range of human diversity, nor is complete. Further, the existing approach followed by the GRC and the genomics industry to construct a “static” reference genome introduces biases in standard bioinformatic pipelines used to detect the unique complement of variants in an individual's genome. An elegant, cost effective bioinformatics pipeline solution to perform the analysis of the sequenced data rapidly, accurately and in a consistent, reproducible way based on a truly population-wide reference is the final frontier to commoditize sequencing.