Genomic data is commonly stored in the .bam or .sam file format. The .sam format is a human readable text format for storing sequenced data in tab delimited ASCII columns. It is a human readable version of the .bam format which stores the same data in a compressed, indexed, binary form. Both formats represent aligned data for a sample along with both quality (QUAL) scores and metadata (Tags). The results of many small reads are aligned and stored along with their quality data stores. A typical whole genome sequence of a human we require approximately 300 GB of storage. In a large scale computing environment 300 GB .bam files can easily consume available storage and clog networks. Existing practice relies heavily compression using techniques such as lempil ziv and gziv. More recent techniques use standard reference genomes, such as HG19, compiled from a variety of human genomes. Quality scores are calculated in the standard manner by the sequencer, such an Illumina, MySeq etc. Quality scores are discussed in many publications including in the on-line publication from Illumina called “Understanding Illumina quality scores”; also see E. Green 1998 “Base-calling of automated sequencer traces using phred. II. Error probabilities”; Genome Research 8: 186-194.
The .sam (and .bam) file formats are well known, standardized, useful, but unfortunately require about 300-400 GB of storage per human sequence even after using compression techniques such as Lempil, Ziv or Gziv. That storage size is simply too large for many purposes, especially if all cancer patient genomic data is to be gathered, stored, and/or transmitted. Since processing cancer genomic data is best practiced in large scale computing environments storing 300 GB .bam files on thousands of cancer patients would easily consume all available storage and clog computer networks.
As an example of the required data sizes consider a rather small trial to correlate genetic mutations with a specific cancer, for example breast cancer, in the hopes of identifying an effective therapy. That trial may have 800 breast cancer patients with 3 to 4 sequences each and would require at least 1 PB of storage per patient. Genomic researchers need to reduce such capacity without loss of quality sequence data, without increased processing time associated with decompression, and without the excessive costs and delays currently associated with moving genomic data from one location to another. Complicating the matter are issues in the alignment of the segment snippet sequences which make existing methods of compression and de-duplication (removal of duplicated data) less effective.
As another example, if precision oncology becomes a reality “whole genome sequencing,” particularly in clinical treatments of cancer, would rapidly consume all available storage unless an effective way of reducing the required data size is implemented. In 2010, 13 million Americans had cancer. With existing technology, a single whole genome sequence for every person would require 39 exabytes (39,000 petabytes, 39 million terabytes or 39 billion gigabytes). There simply isn't enough storage for that.
In view of the foregoing improved data encoding for genomic data would be useful. Beneficially such data encoding would be computer driven to eliminate redundant genomic data (de-duplication). Preferably such encoded data would be compressed and searchable. In addition it should merge re-reads of the same nucleotide into a single nucleotide having an averaged quality score. In practice, the improved data encoding should enable computer processing of the resulting encoded data without loss of information related to multiple nucleotides in sample segments. Ideally, the encoded data would be produced by a computerized DNA sequencing system that would provide encoded data that is so efficiently packed that it would allow individual cancer patients to store their genomic data on a memory stick or other computer readable memory, would enable faster transmissions of data, would require less data storage space, would support standardized data processing, and could enable improved data processing.