I. Field of the Invention
The present invention generally relates to methods and systems for compressing genomic data. More particularly, the present invention relates to compressing genomic data using, for example, delta compression processes.
II. Background Information
Current social, medical, and scientific thinking converges on the idea of “personalized medicine”, broadly interpreted as tailoring clinical decision making based on a patient's genetics. Genetic makeup can not only determine predisposition to disease, but also how the body reacts to various treatment modalities. Mutations in approximately 1,200 (about 3% of the estimated human gene complement) result in Mendelian genetic disease. Genetic factors involved in multifactorial diseases, where familial associations are a factor, such as Type 2 diabetes and cardiovascular disease, are far more numerous. Phenomena such as susceptibility to infection, to fracture, and the aging process are also believed to have a genetic foundation. Accordingly, there is a need for a better understanding of human genetic variation, influence of variation on predisposition to disease, and response to therapy.
Progress towards these goals has been made possible by large scale single nucleotide polymorphism (SNP) discovery projects. Linkage disequilibrium (LD) studies of SNP variation and disease phenotypes appear to hold promise for identifying mutations in genes that underlie disease. SNP genotyping approaches, however, are at best sampling methods, and efforts to optimize them are mostly driven by the need to minimize costs. With affordable on-demand complete human genome sequencing, full knowledge of genomic variation may become possible. When this occurs, the promise of medical decision-making based on individual genetic risk may be realized.
The United States federal government has acknowledged the need to reduce the cost of sequencing a human genome from a staggering $10-$50 million to below $1,000. This may bring the cost more in line with the cost of a clinical test. Cost effective deoxyribonucleic acid (DNA) sequencing technologies may require information systems capable of efficient storage and manipulation of many thousands of fully-sequenced individual genomes. Human genomes are large, occupying upwards of 6×109 bytes each. Although computer storage media are relatively inexpensive, the number of people that may have their genomes sequenced may pose data storage problems. Moreover, problems may be encountered with input/output (I/O) times required to read and write genome sequences to storage devices, and in transmitting genome sequences over networks. Furthermore, the intercomparison of genome sequences poses a challenge. There are currently no methods or systems capable of dealing with the problems this data volume poses.
Accordingly, there is a very significant need for methods and systems that address the above problems. In view of the foregoing, there is a need for methods and systems for compressing genomic data more optimally. Furthermore, there is a need for compressing genomic data using, for example, delta compression processes.