A nucleic acid is a polymeric macromolecule and consists of a sequence of monomers known as nucleotides. Each nucleotide consists of a sugar component, a phosphate group and a nitrogenous base or nucleobase. Nucleic acid molecules where the sugar component of the nucleotides is deoxyribose are DNA (deoxyribonucleic acid) molecules, whereas nucleic acid molecules where the sugar component of the nucleotides is ribose are referred to as RNA (ribonucleic acid) molecules. DNA and RNA are biopolymers appearing in living organisms.
Nucleic acid molecules are assembled as chains or strands of nucleotides. Nucleic acid molecules can be generated artificially and their chain structure can be used for encoding any kind of user data. For storing data in synthesized, i.e. artificially created, DNA or RNA, usually short DNA or RNA fragments (oligonucleotides, short: oligos) are generated. With these nucleic acid fragments, a data storage system can be realized wherein data are stored in nucleic acid molecules. The synthesized nucleic acid molecules carry the information encoded by the succession of the nucleotides forming the nucleic acid molecules. Each of the synthesized nucleic acid molecules consists of a sequence or chain of nucleotides generated by a bio-chemical process and represents an oligo or nucleic acid fragment wherein the sequence or cascade of the nucleotides encodes a code word sequence corresponding to a set of information units, e.g., sets of information bits of user data. For example, in a DNA storage system, short DNA fragments are generated. These molecules can be stored and the information can be retrieved from the stored molecules by reading the sequence of nucleotides using a sequencer.
In this context, the terms “nucleic acid fragment”, “oligonucleotide” and “oligo” are used interchangeably and refer to a short nucleic acid strand. The term “short” in this context is to be understood as short in comparison to a length of natural DNA which encodes genetic instructions used by living organisms and which may consist of millions of nucleotides. Synthesized oligos may contain more than one, for example more than hundred, e.g. between 100 and 300, or several thousands of nucleotides.
Oligonucleotide synthesis or nucleic acid synthesis is the chemical synthesis of oligos with a defined chemical structure, i.e., with a defined sequence of nucleotides, which can be generated by a nucleic acid synthesizer. In other words, a synthesizer can be used to generate artificial, synthetic fragments of nucleic acid molecules, for example DNA fragments, i.e. DNA oligos. This technology enables a provision of data storage systems wherein a write process is based on the creation of nucleic acid fragments as sequences of nucleotides which encode information to be stored.
The generated nucleic acid fragments are stored, for example as solid matter or dissolved in a liquid, in a nucleic acid storage container. The characteristics of the nucleic acid storage may depend on the amount of stored data and an expected time before a readout of the data will take place.
Digital information storage in synthesized DNA or RNA may provide a high-capacity, low-maintenance information storage.
DNA storage has been investigated in “Next-generation digital information storage”, Church et al., Science 337, 1628, 2012, and in “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA”, Goldman et al., Nature, vol. 494, 2013.
In a data storage system, for example a DNA storage system, the synthesized nucleic acid fragments can be stored. The information can be retrieved from the stored nucleic acid fragments by sequencing of the nucleic acid fragments using a sequencer. Sequencing is a process of determining the order of nucleotides within the particular nucleic acid fragment. Sequencing can be interpreted as a read process. The read out order of nucleotides is processed or decoded to recover the original information stored in the nucleic acid fragment. In a nucleic acid storage system the oligos are synthesized, i.e. nucleic acid strands to be stored are created, amplified, i.e., the number of each single oligo is increased, e.g., to several hundreds or thousands, and—after storage—sequenced, i.e., the sequence of nucleotides for each oligo is analyzed.
The data can be any kind of sequential digital source data to be stored, e.g., sequences of binary or quaternary code symbols, corresponding to digitally, for example binary, encoded information, such as textual, image, audio or video data. Due to the limited oligo length, the data is usually distributed to a plurality of oligos.
DNA strands consist of four different nucleotides identified by their respective nucleobases or nitrogenous bases, namely, Adenine, Thymine, Cytosine and Guanine, which are denoted shortly as A, T, C and G, respectively. RNA strands also consist of four different nucleotides identified by their respective nucleobases, namely, Adenine, Uracil, Cytosine and Guanine, which are denoted shortly as A, U, C and G, respectively.
Nucleobases tend to connect to their complementary counterparts via hydrogen bonds. For example, natural DNA usually shows a double helix structure, where A of one strand is connected to T of the other strand, and, similarly, C tends to connect to G. In this context, A and T, as well as C and G, are called complementary. Correspondingly, A with U and G with C form pairs of complementary RNA bases.
Two sequences of nucleotides are considered “reverse complementary” to each other, if an antiparallel alignment of the nucleotide sequences results in the nucleobases at each position being complementary to their counterparts.
Reverse complementarity does not only occur between separate strands of DNA or RNA. It is also possible for a sequence of nucleotides to have internal or self-reverse complementarity. This may result in a folded configuration where the strand or sequence binds to itself and may, for example, form “hairpins” or loops. The self-reverse complimentary areas of one strand may interconnect and form the “hairpins”, for example during the amplification process, leading to the effect that the related nucleic acid oligo may completely be missing in the amplification result.
While a single short self-complementary segment may be acceptable, for example, long self-reverse complementary DNA fragments may not be readily sequenced and corresponding DNA strands will possibly not appear in readouts, which prohibits or at least hinders later decoding of the information encoded in the strand.
To avoid the self-complementarity problem, each oligo could be checked after creation on self-reverse complementariness, which requires high computational effort.
Therefore, a specific coding should be used that defines, how the source information (e.g. provided in bytes) is represented by a plurality of nucleotides to avoid synthesis of long reverse complementary nucleotide sequences when storing data in nucleic acid strands.