1. Technical Field
The present invention generally relates to managing the storage of data on a storage medium, and more particularly, to managing the storage of data content using searchable data blocks on a secondary storage system.
2. Description of the Related Art
A common mechanism for storing information is a Content Addressable Storage (CAS) system, which bases an address for a block of data on its content rather than a pre-determined storage location. Typically, CAS systems are employed for fast storage and retrieval of relatively fixed content in secondary or “permanent” storage. Content Addressable Storage (CAS) systems provide access to data stored through the use of content addresses. A content address (CA) is generally formed by combining several pieces of information, at least one of which depends on the content of the object stored. In general, at least a part of a content address is derived from applying a strong hash function, such as SHA-1, on the contents of an associated data block of an object.
In contrast to conventional storage systems, a storage system based on content addresses is immutable in the sense that once a data block is written, it cannot be changed, as changing the data content of a block will also change its address. This not only gives users some guarantee that the data retrieved is exactly the same as the data stored, but it also permits the system to avoid storing duplicated blocks. For example, if the user performs multiple write operations for the same data, the system will store only one copy of the data and return the same content address for each of the write operations. This is possible because the address of a block of data is determined by the system. It should be noted, however, that although CAS systems are described herein as being immutable, “immutable” should not be construed to mean that data blocks cannot be deleted. Rather, an “immutable” system should be construed to mean that the system prevents data content from being referenceable with a content address already used for different data content.
Unfortunately, when employing a CAS system, a user must store the CA after writing an object in order to retain the capability of retrieving or reading the object at a later time. For example, because the CA cannot be derived without having the original content due to the use of the hash function, there is no way for a user to retrieve a block without storing the content address. In addition, even with advanced systems, such as EMC's C-clip, in which CAs are embedded in stored objects to permit for the creation of directed acyclic graphs (DAGs), the root of the DAG is a CA that includes address bits which are not derivable without the content. Upon writing an object, the C-clip's content address is returned to the application which must store it in a different location.
Thus, current storage systems employing CAS are not self-contained, as they need separate storage that retains the CAs of root blocks and, in many systems, other blocks as well.