The present invention relates to minimizing surprisal data generated when compared to a reference genome and more specifically to minimizing surprisal data through application of a hierarchy of reference genomes.
DNA gene sequencing of a human, for example, generates about 3 billion (3×109) nucleotide bases. Currently all 3 billion nucleotide base pairs are transmitted, stored and analyzed, with each base pair typically represented as two bits. The storage of the data associated with the sequencing is significantly large, requiring at least 3 gigabytes of computer data storage space to store the entire genome, which includes only nucleotide sequenced data and no other data or information, such as annotations. If the entire genome includes other information, such as annotations, the genome may require terabytes worth of storage. The movement of the data between institutions, laboratories and research facilities is hindered by the significantly large amount of data, the significant amount of storage necessary to contain the data, and the resources necessary to directly transmit the data. For example, some research facilities can spend upwards of $2 million dollars for transmitting genetic data and sending genetic data that is large, for example terabytes of data, that includes annotations and specifics regarding the genetic sequence or genome. The transfer of a genetic sequence that is very large can take a significant amount of time over a network data processing system.