The value of data may vary over time. However, conventional object storage systems may treat data the same throughout its lifetime, regardless of the freshness of the data. For example, conventional systems may employ a static or pre-defined disk safety factor that controls the redundancy of data by controlling how many erasure codes are stored for an item. In these conventional systems, the safety factor may be established once when the data is stored.
The value of some data may increase over time. For example, in an archive scenario, the original source content may be the most important to protect, while recent changes to the original source content may be less valuable. The value of other data may decrease over time. For example, in a backup scenario, the newest data or the most recently saved data may be important right after it is saved, but may be less important as time moves on. Over time, other backup copies may be made, which may reduce the value of an earlier backup. Consider analytic data like a weblog. Data in a weblog may have greater value when new but then may have less and less value over time as, for example, more recent data is saved to the weblog.
Conventional systems may be unable to dynamically adjust the disk safety factor for an item and thus may be unable to account for the changing value of data over time. Being unable to account for the changing value of data may require choosing between wasting storage space for excessive redundancy at some point in the data life cycle or saving storage but being exposed to an undesirable risk of data loss at some point in the data life cycle.
Different approaches may be used to protect files, information about files, or other electronic data. For example, an object store may interact with an archive system to store a file, to store information about a file, or to store other electronic data. To insure data protection, different approaches for storing redundant copies of items have been employed. Erasure codes are one such approach. An erasure code is a forward error correction (FEC) code for the binary erasure channel. The FEC facilitates transforming a message of k symbols into a longer message with n symbols such that the original message can be recovered from a subset of the n symbols, k and n being integers, n>k. The original message may be, for example, a file. The fraction r=k/n is called the code rate, and the fraction k′/k, where k′ denotes the number of symbols required for recovery, is called the reception efficiency. Optimal erasure codes have the property that any k out of the n code word symbols suffice to recover the original message. Optimal codes may require extensive memory usage, CPU time, or other resources when n is large.
Erasure codes are described in coding theory. Coding theory is the study of the properties of codes and their fitness for a certain purpose (e.g., backing up files). Codes may be used for applications including, for example, data compression, cryptography, error-correction, and network coding. Coding theory involves data compression, which may also be referred to as source coding, and error correction, which may also be referred to as channel coding. Fountain codes are one type of erasure code.
Fountain codes have the property that a potentially limitless sequence of encoding symbols may be generated from a given set of source symbols in a manner that supports ideally recovering the original source symbols from any subset of the encoding symbols having a size equal to or larger than the number of source symbols. A fountain code may be optimal if the original k source symbols can be recovered from any k encoding symbols, k being an integer. Fountain codes may have efficient encoding and decoding algorithms that support recovering the original k source symbols from any k′ of the encoding symbols with high probability, where k′ is just slightly larger than k. A rateless erasure code is distinguished from an erasure code that exhibits a fixed code rate.
Object based storage systems may employ rateless erasure code technology (e.g., fountain codes) to provide a flexible level of data redundancy. The appropriate or even optimal level of data redundancy produced using a rateless erasure code system may depend, for example, on the value of the data. The actual level of redundancy achieved using a rateless erasure code system may depend, for example, on the difference between the number of readable redundancy blocks (e.g., erasure codes) written by the system and the number of redundancy blocks needed to reconstruct the original data. For example, if twenty redundancy blocks are written and only eleven redundancy blocks are needed to reconstruct the original data that was protected by generating and writing the redundancy blocks, then the original data may be reconstructed even if nine of the redundancy blocks are damaged or otherwise unavailable.