Technical Field
The present invention relates to bioinformatics, and more particularly to the size reduction of a genome database.
Description of the Related Art
Genome sequencing has been greatly enhanced by the development of next-generation sequencing techniques and advances in the machines that do the sequencing. Sequencing a human genome, which contains 3.2 billion base pairs, generates a massive amount of data that requires hundreds of gigabytes (GB) of storage space. The original plan of the 1000 Genomes project, launched in 2008, was to sequence the genomes of at least 1000 anonymous participants from different ethnic groups, using faster, less expensive technologies.
In 2012, the sequencing of 1092 genomes by the 1000 Genomes project was announced in Nature. Since then, administrators indicated that “as of March 2013, our ftp site is 464 TB [terabytes] and continuing to grow”.
There are several data formats in which genome data is stored. The FASTQ format is a text-based format for storing a biological sequence (e.g., a nucleotide sequence) and corresponding quality scores. FASTQ encodes each of the sequence letters and the quality score with ASCII characters. FASTQ is the de facto standard for storing the output of high throughout sequencing devices, such as the Illumina® Genome Analyzer.
The FASTQ system is highly redundant as it reads sequence information within a sample and across samples, and many of the sequence reads consist of the same sequence. Different DNA formats make use of the FASTQ sequencing data, for example the SAM, BAM and CRAM file formats. These file formats require enormous amounts of computer storage space.