In disk-based storage systems, there is usually a clear separation between the primary storage function—which deals with providing rapid and efficient access to active data—and secondary storage mechanisms which deal with less active data, with long term data protection, and with maintaining archives of historical storage contents.
These secondary functions have, for the most part, traditionally been handled using magnetic tape storage. Reasons for this include the fact that tape has been much cheaper than disk storage (and other alternatives), and tape cartridges are easily transported to provide offsite copies of data to protect against loss due to localized disasters.
For a number of years, the cost per byte of disk hardware has been dropping at a much faster rate than that of tape hardware, making disk increasingly attractive as an alternative to tape as a medium for secondary storage. Some of the properties of disk, such as low-latency random access, clearly make it superior to tape as a secondary storage medium. If, however, the superior properties of disk are exploited in a secondary storage system, then new challenges arise which did not previously exist with tape.
For example, since every hard disk drive includes the mechanism for reading and writing the media that it contains, in a disk-based secondary storage system it becomes attractive to keep all data online at all times. This means that traditional mechanisms for protecting archival data, based on physically isolating and protecting the storage media, become inapplicable. One could simply turn the disks into write-once media by disallowing deletions in hardware, but then deletion of old data that are no longer needed would also be prohibited.
Moreover, for low cost safe disk storage it may be attractive to use an object storage scheme, such as is described in Margolus et al., “A Data Repository and Method for Promoting Network Storage of Data,” US 2002/0038296 A1 (Mar. 28, 2002). An object storage system is like a file system without a built-in mechanism for organizing the files (“objects”) into a hierarchy. The clients of the object storage system must define and implement any such mechanism, for example by storing directory information in objects. This lack of built-in hierarchy separates out a complicated issue from the implementation of the storage system itself.
In the example of Margolus et al. US 2002/0038296, security and privacy considerations are addressed by assuming that the storage system has little or no access to information about the structure or nature of the data that it stores. This constraint adds an extra dimension to the problem of safely allowing deletion of unnecessary data, while protecting necessary data from malicious or accidental deletion.
If deletion of unnecessary data is to be allowed, mechanisms are of course required for determining which data has become unnecessary. Traditional backup schemes maintain “snapshots” of storage system contents at predefined moments, discarding some snapshots as unnecessary after some period of time. File servers often use an on-disk snapshotting mechanism for short-term protection of files from data corruption or accidental deletion. Commonly, this is implemented by simply avoiding overwriting data that is needed for some existing snapshot, and instead writing the new data to a new location (and maintaining appropriate indexing information for finding the different versions of files). A snapshot is created by declaring at some point in time that no data that exists at that point will be overwritten. A snapshot is discarded by freeing storage resources that are not needed by any other snapshot, and are not currently in use.
Thus one definition of unnecessary data is data that is only needed by discarded historical snapshots. The challenge of deleting only unnecessary data then requires reconciling this definition with the constraints and structure of a distributed, private and secure storage system. For example, it may not be possible, in general, for a storage server to determine which stored data is part of a given historical version, or even which historical versions exist. This problem is compounded if some pieces of data are shared: different historical versions of the same object, or even different objects, may all share common pieces of data, for storage efficiency. These pieces may only be deleted when they are no longer needed by any version of any object. Finally, there may be more sophisticated needs for the protection of historical information than are provided by simple snapshotting.