A nucleic acid is a polymeric macromolecule and consists of a sequence of monomers known as nucleotides. Each nucleotide consists of a sugar component, a phosphate group and a nitrogenous base or nucleobase. Nucleic acid molecules where the sugar component of the nucleotides is deoxyribose are DNA (deoxyribonucleic acid) molecules, whereas nucleic acid molecules where the sugar component of the nucleotides is ribose are referred to as RNA (ribonucleic acid) molecules. DNA and RNA are biopolymers appearing in living organisms.
Nucleic acid molecules are assembled as chains or strands of nucleotides. Nucleic acid molecules can be generated artificially and their chain structure can be used for encoding any kind of user data. For storing data in synthesized, i.e. artificially created, DNA or RNA, usually short DNA or RNA fragments (oligonucleotides, short: oligos) are generated. With these nucleic acid fragments, a data storage system can be realized wherein data are stored in nucleic acid molecules. The synthesized nucleic acid molecules carry the information encoded by the succession of the different nucleotides forming the nucleic acid molecules. Each of the synthesized nucleic acid molecules consists of a sequence or chain of nucleotides generated by a bio-chemical process using a synthesizer and represents an oligo or nucleic acid fragment wherein the sequence or cascade of the nucleotides encodes a code word sequence corresponding to a set of information units, e.g., sets of information bits of user data. For example, in a DNA storage system, short DNA fragments are generated. These molecules can be stored and the information can be retrieved from the stored molecules by reading the sequence of nucleotides using a sequencer.
Sequencing is a process of determining the order of nucleotides within the particular nucleic acid fragment. Sequencing can be interpreted as a read process. The read out order of nucleotides is processed or decoded to recover the original information stored in the nucleic acid fragment.
In this context, the terms “nucleic acid fragment”, “oligonucleotide” and “oligo” are used interchangeably and refer to a short nucleic acid strand. The term “short” in this context is to be understood as short in comparison to a length of natural DNA which encodes genetic instructions used by living organisms and which may consist of millions of nucleotides. Synthesized oligos may contain more than one, for example more than hundred, e.g. between 100 and 300, or several thousands of nucleotides.
This technology enables a provision of data storage systems wherein a write process is based on the creation of nucleic acid fragments as sequences of nucleotides which encode information to be stored.
The generated nucleic acid fragments are stored, for example as solid matter or dissolved in a liquid, in a nucleic acid storage container. The characteristics of the nucleic acid storage may depend on the amount of stored data and an expected time before a readout of the data will take place.
Digital information storage in synthesized DNA or RNA may provide a high-capacity, low-maintenance information storage.
DNA storage has been investigated in “Next-generation digital information storage”, Church et al., Science 337, 1628, 2012, and in “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA”, Goldman et al., Nature, vol. 494, 2013.
The data can be any kind of sequential digital source data to be stored, e.g., sequences of binary or quaternary code symbols, corresponding to digitally, for example binary, encoded information, such as textual, image, audio or video data. Due to the limited oligo length, the data is usually distributed to a plurality of oligos.
In such a nucleic acid storage system the oligos are subject to several processing stages: The oligos are synthesized, i.e. nucleic acid strands to be stored are created, amplified, i.e., the number of each single oligo is increased, e.g., to several hundreds or thousands, and sequenced, i.e., the sequence of nucleotides for each oligo is analyzed. These processing stages can be subject to errors, resulting in non-decodable or incorrectly decoded information.
DNA strands consist of four different nucleotides identified by their respective nucleobases or nitrogenous bases, namely, Adenine, Thymine, Cytosine and Guanine, which are denoted shortly as A, T, C and G, respectively. RNA strands also consist of four different nucleotides identified by their respective nucleobases, namely, Adenine, Uracil, Cytosine and Guanine, which are denoted shortly as A, U, C and G, respectively.
The information is stored in sequences of the nucleotides. Regarded as an information transmission system, such mapping from information bits to different nucleotides can be interpreted as modulation with A, T, C, G as modulation symbols (or A, U, C and G, respectively), where the symbol alphabet size is 4. Reversely, the decision rule from a given symbol tuple or target code word to an information bit tuple or source code word can be referred to as demodulation.
Nucleobases tend to connect to their complementary counterparts via hydrogen bonds. For example, natural DNA usually shows a double helix structure, where A of one strand is connected to T of the other strand, and, similarly, C tends to connect to G. In this context, A and T, as well as C and G, are called complementary. Correspondingly, A with U and G with C form pairs of complementary RNA bases.
Two sequences of nucleotides are considered “reverse complementary” to each other, if an antiparallel alignment of the nucleotide sequences results in the nucleobases at each position being complementary to their counterparts. Reverse complementarity does not only occur between separate strands of DNA or RNA. It is also possible for a sequence of nucleotides to have internal or self-reverse complementarity. As an example, a DNA fragment is considered self-reverse complementary, if the fragment is identical to itself after complementary, reversing steps. For example, a DNA fragment AATCTAGATT is self-reverse complementary: original DNA fragment —AATCTAGATT; complementary—TTAGATCTAA; order reversing—AATCTAGATT.
Long self-reverse complementary fragments may not be readily sequenced which hinders correct decoding of the information encoded in the strand.
Further, tests have shown that nucleotide run lengths, i.e. cascades or sequences of identical nucleotides may reduce sequencing accuracy if the run length exceeds a certain length.
Furthermore, as the amplification process and the sequencing introduce errors in the oligos at different locations, many sequenced oligos may not contain the correct information.
Therefore, a specific modulation coding should be used that allows encoding of information or source data at a high coding efficiency while having a reduced probability of incorrect decoding.