Field of the Invention
The disclosure relates to a method and apparatus for the storage of digital information in DNA.
Brief Description of the Related Art
DNA has the capacity to hold vast amounts of information, readily stored for long periods in a compact form. Bancroft, C., Bowler, T., Bloom, B. & Clelland, C. T. Long-term storage of information in DNA. Science 293, 1763-1765 (2001) and Cox, J. P. L. Long-term data storage in DNA. TRENDS Biotech. 19, 247-250 (2001). The idea of using DNA as a store for digital information has existed since 1995. Baum, E. B. Building an associative memory vastly larger than the brain. Science 268, 583-585 (1995). Physical implementations of DNA storage have to date stored only trivial amounts of information—typically a few numbers or words of English text. Clelland, C. T., Risca, V. & Bancroft, C. Hiding messages in DNA microdots. Nature 399, 533-534 (1999); Kac, E. Genesis (1999) http://www.ekac.org/geninfo.html accessed online, 2 Apr. 2012; Wong, P. C., Wong, K.-K. & Foote, H. Organic data memory. Using the DNA approach. Comm. ACM 46, 95-98 (2003); Ailenberg, M. & Rotstein, O. D. An improved Huffman coding method for archiving text, images, and music characters in DNA. Biotechniques 47, 747-754 (2009); Gibson, D. G. et al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52-56 (2010). The inventors are unaware of large-scale storage and recovery of arbitrarily sized digital information encoded in physical DNA, rather than data storage on magnetic substrates or optical substrates.
Currently the synthesis of DNA is a specialized technology focused on biomedical applications. The cost of the DNA synthesis has been steadily decreasing over the past decade. It is interesting to speculate at what timescale data storage in a DNA molecule, as disclosed herein, would be more cost effective than the current long term archiving process of data storage on tape with rare but regular transfer to new media every 3 to 5 years. Current “off the shelf” technology for DNA synthesis equates to a price of around 100 bytes per U.S. dollar. Newer technology commercially available from Agilent Technologies (Santa Clara, Calif.) may substantially decrease this cost. However, account also needs to be made for regular transfer of data between tape media. The questions are both the costs for this transfer of data and whether this cost is fixed or diminishes over time. If a substantial amount of the cost is assumed to be fixed, then there is a time horizon at which use of DNA molecules for data storage is more cost effective than regular data storage on the tape media. After 400 years (at least 80 media transfers), it is possible that this data storage using DNA molecules is already cost effective.
The high capacity of DNA to store information stably under easily achieved conditions has made DNA an attractive target for information storage since 1995. In addition to information density, DNA molecules have a proven track record as an information carrier, longevity of the DNA molecule is known and the fact that, as a basis of life on Earth, methods for manipulating, storing and reading the DNA molecule will remain the subject of continual technological innovation while there remains DNA-based intelligent life. Data storage systems based on both living vector DNA (in vivo DNA molecules) and on synthesized DNA (in vitro DNA) have been proposed. The in vivo data storage systems have several disadvantages. Such disadvantages include constraints on the quantity, genomic elements and locations that can be manipulated without affecting viability of the DNA molecules in the living vector organisms. Examples of such living vector organisms include but are not limited to bacteria. The reduction in viability includes decreasing capacity and increasing the complexity of information encoding schemes. Furthermore, germline and somatic mutation will cause fidelity of the stored information and decoded information to be reduced over time and possibly a requirement for storage conditions of the living DNA to be carefully regulated.
In contrast, the “isolated DNA” (i.e., in vitro DNA) is more easily “written” and routine recovery of examples of the non-living DNA from samples that are tens of thousands of years old indicates that a well-prepared non-living DNA sample should have an exceptionally long lifespan in easily-achieved low-maintenance environments (i.e. cold, dry and dark environments). See, Shapiro, B. et al. Rise and fall of the Beringian steppe bison. Science 306, 1561-1565 (2004); Poinar, H. K. et al. Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science 311, 392-394 (2005); Willerslev, E. et al. Ancient biomolecules from deep ice cores reveal a forested southern Greenland. Science 317, 111-114 (2007); Green, R. E. et al. A draft sequence of the Neanderthal genome. Science 328, 710-722 (2010); Anchordoquy, T. J. & Molina, M. C. Preservation of DNA. Cell Preservation Tech. 5, 180-188 (2007); Bonnet, J. et al. Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucl. Acids Res. 38, 1531-1546 (2010); Lee, S. B., Crouse, C. A. & Kline, M. C. Optimizing storage and handling of DNA extracts. Forensic Sci. Rev. 22, 131-144 (2010).
Previous work on the storage of information (also termed data) in the DNA has typically focused on “writing” a human-readable message into the DNA in encoded form, and then “reading” the encoded human-readable message by determining the sequence of the DNA and decoding the sequence. Work in the field of DNA computing has given rise to schemes that in principle permit large-scale associative (content-addressed) memory, but there have been no attempts to develop this work as practical DNA-storage schemes. Baum, E. B. Building an associative memory vastly larger than the brain. Science 268, 583-585 (1995); Tsaftaris, S. A. & Katsaggelos, A. K. On designing DNA databases for the storage and retrieval of digital signals. Lecture Notes Comp. Sci. 3611, 1192-1201 (2005); Yamamoto, M., Kashiwamura, S., Ohuchi, A. & Furukawa, M. Large-scale DNA memory based on the nested PCR. Natural Computing 7, 335-346 (2008); Kari, L. & Mahalingam, K. DNA computing: a research snapshot. In Atallah, M. J. & Blanton, M. (eds.) Algorithms and Theory of Computation Handbook, vol. 2. 2nd ed. pp. 31-1-31-24 (Chapman & Hall, 2009). FIG. 1 shows the amounts of information successfully encoded and recovered in 14 previous studies (note the logarithmic scale on the y-axis). Points are shown for 14 previous experiments (open circles) and for the present disclosure (solid circle). The largest amount of human-readable messages stored this way is 1280 characters of English language text8, equivalent to approximately 6500 bits of Shannon information. Gibson, D. G. et al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52-56 (2010); MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms. (Cambridge University Press, 2003).
The Indian Council of Scientific and Industrial Research has filed a U.S. Patent Application Publication No. 2005/0053968 (Bharadwaj et al) that teaches a method for storing information in DNA. The method of U.S. '968 comprises using an encoding method that uses 4-DNA bases representing each character of an extended ASCII character set. A synthetic DNA molecule is then produced, which includes the digital information, an encryption key, and is flanked on each side by a primer sequence. Finally, the synthesized DNA is incorporated in a storage DNA. In the event that the amount of DNA is too large, then the information can be fragmented into a number of segments. The method disclosed in U.S. '968 is able to reconstruct the fragmented DNA segments by matching up the header primer of one of the segments with the tail primer on the subsequent one of the segments.
Other patent publications are known which describe techniques for storing information in DNA. For example, U.S. Pat. No. 6,312,911 teaches a steganographic method for concealing coded messages in DNA. The method comprises concealing a DNA encoded message within a genomic DNA sample followed by further concealment of the DNA sample to a microdot. The application of this U.S. '911 patent is in particular for the concealment of confidential information. Such information is generally of limited length and thus the document does not discuss how to store items of information that are of longer length. The same inventors have filed an International Patent Application published as International Publication No. WO 03/025123.