Technical Field
The present invention relates to genetic sequencing and, more particularly, to storage techniques for large volumes of genome data.
Description of the Related Art
Historically, genetic sequencing was a slow process that generated genome data at a rate that could be easily accommodated by existing storage technologies. Next generation sequencing technologies, however, produce genome data much more rapidly, and the rate of data production is increasing at a rate that outstrips storage capacity technology advances. For example, the European Bioinformatics Institute estimates that their storage capacity needs double year-to-year. The capacity of storage technologies, meanwhile, increases roughly in accordance with Moore's law, doubling every eighteen months.
In Inc early stages UI generating genome; data, data is often produced in a human-readable (e.g., ASCII) format called SAM. However, SAM is very inefficient and produces files that are terabytes in size. These SAM files are often converted to BAM, which is a binary form and is typically compressed. A compressed BAM file is a sequence of compressed BAM blocks, with uncompressed BAM blocks each having about 64 kb of data. However, in the face of the tremendous rate at which genomic data is generated, even BAM files are growing unsuitable for storage.
Superior data management techniques are therefore needed. If disks are not large enough to store all of the available data, an additional cost is incurred by file management in determining which files should be deleted and which should be preserved.
Tape storage is frequently used for archiving genome data. Standard tape devices often implement hardware data compression and can perform real-time compression of archived data. However, data compression by tape devices is not effective for genome data that has already been compressed using, e.g., BAM compression.