1. Field of the Invention
The present invention relates to the storage of a subject genome.
2. Description of the Related Art
A genome is comprised of an organism's complete set of deoxyribonucleic acid (DNA). DNA in the human genome is comprised of 22 pairs of chromosomes and 2 sex chromosomes. Each chromosome consists of many genes, which are the functional and physical units of heredity passed from parent to offspring. Researchers call determine what kind of illnesses a person may be predisposed to by studying the genes contained within the individual's genome. For example, a number of genes have been identified and associated with breast cancer, muscle disease, deafness, and blindness. Early detection of disease can lead to an understanding of how a specific medicine will work on an individual, allow doctors to design drug treatments that are specifically customized to an individual's unique genome, assist in the development of effective new therapies, and even lead to early intervention for chronic illnesses.
In order to analyze the biological properties of a gene, however scientists must further break the gene into its component parts, or nucleotide bases. A nucleotide base consists of one base chemical, namely, adenine, thymine, guanine, and cytosine. Every nucleotide base further consists of an additional molecule each of sugar and phosphoric acid. Notably, a gene is comprised of a specific sequence of nucleotide bases. or nucleotide sequences, that encode instructions on how to make proteins. Approximately 2% of a human genome is comprised of genes, while repeated nucleotide sequences which do not code for proteins (“junk DNA”) make up at least 50% of the human genome. The remainder consists of non-coding regions whose functions are still unclear, but may provide chromosomal structural integrity and regulate the manufacture of proteins. Researchers analyze DNA sequence patterns within the genome in order to identify human genes and interpret their functions. For example, human genes appear to be concentrated in random areas along the genome, with vast expanses of non-coding DNA in between.
The problem faced by modern genomic research is that the amount of memory space required to store a representation of an individual human genome for research purposes is daunting. For example, the human genome contains over 3 billion nucleotide bases, and approximately 30,000 genes. Current methods of compressing genomic data include representing individual nucleotide bases using a 2-bit code, which can reduce the storage requirement from 3 billion characters to 750,000,000 characters. A further reduction of approximately 10 to 20% can be accomplished by accounting for repeated sequences of nucleotide bases, thereby reducing the storage requirement to about 600,000,000 characters. However, even with the reduction, this represents a formidable amount of data to store and process per individual genome.