Different approaches, devices, and collections of devices may be used to protect files, information about files, or other electronic data. For example, a tiered archive system may use tape drives, disk drives, solid state drives (SSD), and an object store to store a file, to store information about a file, to store redundant copies of files, or to store other electronic data.
To insure data protection, different approaches for storing redundant copies of items have been employed. Erasure codes are one such approach. An erasure code is a forward error correction (FEC) code for the binary erasure channel. The FEC facilitates transforming a message of k symbols into a longer message with n symbols such that the original message can be recovered from a subset of the n symbols, k and n being integers, n>k. The original message may be, for example, a file. The fraction r=k/n is called the code rate, and the fraction k′/k, where k′ denotes the number of symbols required for recovery, is called the reception efficiency. Optimal erasure codes have the property that any k out of the n code word symbols are sufficient to recover the original message. Optimal codes may require extensive memory usage, CPU time, or other resources when n is large.
Erasure codes are described in coding theory. Coding theory is the study of the properties of codes and their fitness for a certain purpose (e.g., backing up files). Codes may be used for applications including, for example, data compression, cryptography, error-correction, and network coding. Coding theory involves data compression, which may also be referred to as source coding, and error correction, which may also be referred to as channel coding. Fountain codes are one type of erasure code.
Fountain codes have the property that a potentially limitless sequence of encoding symbols may be generated from a given set of source symbols in a manner that supports ideally recovering the original source symbols from any subset of the encoding symbols having a size equal to or larger than the number of source symbols. A fountain code may be optimal if the original k source symbols can be recovered from any k encoding symbols, k being an integer. Fountain codes may have efficient encoding and decoding algorithms that support recovering the original k source symbols from any k′ of the encoding symbols with high probability, where k′ is just slightly larger than k. A rateless erasure code is distinguished from an erasure code that exhibits a fixed code rate.
Object based storage systems may employ rateless erasure code technology (e.g., fountain codes) to provide a flexible level of data redundancy. The appropriate or even optimal level of data redundancy produced using a rateless erasure code system may depend, for example, on the number and type of devices available to the object based storage system. The actual level of redundancy achieved using a rateless erasure code system may depend, for example, on the difference between the number of readable redundancy blocks (e.g., erasure codes) written by the system and the number of redundancy blocks needed to reconstruct the original data. For example, if twenty redundancy blocks are written and only eleven redundancy blocks are needed to reconstruct the original data that was protected by generating and writing the redundancy blocks, then the original data may be reconstructed even if nine of the redundancy blocks are damaged or otherwise unavailable.
Object based storage systems using rateless erasure code technology may facilitate storing erasure codes generated according to different redundancy policies (e.g., 7/3, 20/9, 20/2). A redundancy policy may be referred to as an N/M redundancy policy where N total erasure codes are generated and the message can be regenerated using any N-M of the N total erasure codes, M and N being integers, M<N.
When an object storage is used in a tiered archive system, the overall data redundancy achieved by the tiered archive system depends on the distribution of data between devices and the redundancy policies associated with the object store. The distribution of data refers to the number and types of devices participating in storing data. The redundancy policies may include an N/M policy. For example, if the tiered archive system includes a RAID-6 tier at one site, and if data is present in the RAID-6 tier, then an optimal overall data redundancy might be provided by implementing a 20/2 erasure code policy at an object storage at a single other site. If the RAID-6 tier is released or becomes otherwise unavailable, then the optimal overall data redundancy may be provided by an 18/8 erasure code policy spread across three sites. The “optimal” overall data redundancy may be defined by, for example, a system administrator or data retention expert.
Since conditions in a tiered archive system can vary dynamically as, for example, devices become available or unavailable, the optimal overall data redundancy may also vary dynamically. However, conventional tiered archive systems may use pre-defined configuration settings to statically determine erasure code policies.