The present invention relates to gene sequencing, and more specifically to parallelization of surprisal data reduction and genome construction from genetic data for transmission, storage and analysis.
DNA gene sequencing of a human, for example, generates about 3 billion (3×109) nucleotide bases. Currently all 3 billion nucleotide base pairs are transmitted, stored and analyzed. The storage of the data associated with the sequencing is significantly large, requiring at least 3 gigabytes of computer data storage space to store the entire genome which includes only nucleotide sequenced data and no other data or information such as annotations. The movement of the data between institutions, laboratories and research facilities is hindered by the significantly large amount of data and the significant amount of storage necessary to contain the data.
Often time during analysis, a sequence of an organism is compared to a reference genome of the organism. Depending on the number of bases and length of the genome, the comparison can take a significant amount of time, especially when being carried out by only one computer processor.