1. Field
Embodiments of the invention relate to space recovery with storage management coupled with a deduplicating storage system.
2. Description of the Related Art
A storage-management server provides a repository for computer information that is backed up, archived, or migrated from client nodes in a computer network. A storage-management server stores data objects in one or more storage pools in a repository and uses a database for tracking metadata about the stored data objects. Stored data objects may be deleted from the storage-management server based on retention rules or by manual administrative action. When the storage-management server deletes a data object from the repository, metadata pertaining to that data object is deleted from the database. This constitutes logical deletion of the data object because the data is not readily accessible without the corresponding metadata.
After data objects have been logically deleted, the storage-management server may perform a reclamation operation to recover space from aggregates of data objects or from sequential-access volumes on which the data objects are stored. This reclamation operation is typically done by copying remaining data objects from one storage location to another, thereby consolidating the data.
Deduplication describes a scenario in which common data is reduced to a single copy and redundant copies are replaced with references (e.g., pointers) to the original copy. In a typical configuration, a disk-based deduplicating storage system, such as a disk array or a Virtual Tape Library (VTL), has the capability to detect redundant data extents and reduce duplication by avoiding the redundant storage of such extents.
For example, the deduplicating storage system may divide file A into extents a-h, detect that extents b and e are redundant, and store the redundant extents only once. The redundancy could occur within file A or with other files stored in the deduplicating storage system. As another example, deduplicating storage system may store a first file with extents (also known as chunks) x-z. The deduplicating storage system may then divide a second file into extents a-h and determine that extents b and e are the same as extents y and z in the first file (i.e., extents b and e are redundant). Then, the deduplicating storage system does not store extents b and e again. Instead, the deduplicating storage system stores the second file with a list of extents comprising the file, including references for extents b and e to corresponding extents y and z. Thus, with deduplication, redundant extents are stored once.
Various technologies have been adopted for deduplicating data objects. Deduplication may be performed as data objects are ingested by the storage-management server or after ingestion. Ingestion may be described as occurring when the storage-management server receives data objects from a client, stores those data objects in its repository, and inserts metadata about the data objects into the database.
Some systems combine a storage-management server with a deduplicating storage system. Typically, the storage-management functions are decoupled from physical data storage and deduplication. This introduces the need for two levels of space recovery: 1) logical space recovery and 2) physical space recovery.
1. Logical space recovery may be required after data objects are deleted by the storage-management server, especially if the data objects are stored sequentially within aggregates or sequential-access volumes. An aggregate may be described as a collection of two or more data objects stored sequentially and treated as a single entity for efficiency. For example, it is typically more efficient to move an entire aggregate as a unit rather than individually moving each data object in the aggregate.
2. Physical space recovery may be required as the deduplicating storage system detects duplicate extents and attempts to free the space occupied by those extents.
The two levels of space recovery may interact, causing the storage-management server and deduplicating storage system to work against each other.
1. Physical space recovery by the deduplicating storage system can invalidate references to data object storage locations as tracked by the storage-management server. This can be especially problematic if deduplication is performed after data ingest because it forces massive updates in the storage-management server database.
2. Reclamation by the storage-management server to recover space occupied by deleted data objects within aggregates or sequential-access volumes can force the deduplicating storage system to redrive deduplication operations (i.e., perform the deduplication operations again), which can be very costly in terms of computing resources. This can occur because movement of data by the reclamation operation on the storage-management server invalidates the extent information maintained by the deduplicating storage system and forces that system to repeat redundancy checking of the data at the new storage location.
The challenge is to manage storage efficiently to recover space from deleted extents whether those extents are deleted via deduplication or as a result of logical deletion of data objects.
Existing solutions have one or more of the following disadvantages:
1. Logical reclamation by the storage-management server requires physical data movement.
2. Logical reclamation by the storage-management server not only consumes computing resources for that operation, but can also cause the deduplicating storage system to redrive deduplication, which consumes additional resources.
3. Physical recovery of space occupied by deduplicate extents in the deduplicating storage system can invalidate storage location references in the storage-management system, forcing updates to those references
Thus, there is a need for improved space recovery for storage management coupled with a deduplicating storage system.