There is a continually growing need for efficient and reliable storage of data. Storage servers are often used to manage data storage activities. A storage server is a specialized computer that provides storage services related to the organization and storage of data, to one or more clients. A storage server is configured to operate according to a client/server model of information delivery and may enable many clients or applications to access the data served by the system. A storage server can employ a storage architecture that serves the data with both random and streaming access patterns at either a file level, as in network attached storage (NAS) environments, or at the block level, as in a storage area network (SAN). Data is typically stored on writable persistent storage media, such as non-volatile memories and disks.
In order to more efficiently use the available data storage space in a storage system and/or minimize the costs associated with storing data, various techniques are used to compress data or otherwise reduce the amount of physical storage space required by the system to store the data. In some cases, a storage system may contain many duplicate instances of the same data. For example, a large presentation file containing multimedia content may be distributed to many employees of a company. Even though each instance of this file is the same, many copies of it may end up being stored in the company's data storage system.
Data deduplication is one method of removing duplicate instances of data from a storage system. Data deduplication is a technique for eliminating coarse-grained redundant data. In a deduplication process, blocks of data are compared to other blocks of data stored in the system. When two or more identical blocks of data are identified, the redundant block(s) are deleted or otherwise released from the system. As used herein, a “block” of data is a contiguous set of data of a known length starting at a particular address value. In certain embodiments, each block is 4 kBytes in length. However, in a given implementation blocks could be another size or sizes.
When a data block is deduplicated, the block is effectively removed from the data storage system because a duplicate instance of the data block already exists in the data storage system. Metadata that was previously associated with the deleted block(s) is modified to point to the one instance of the data block that was not deleted. In this way, multiple files or applications can share the same block of data. The deduplication process saves storage space and resources by coalescing the duplicate data blocks and coordinating the sharing of single instances of common data blocks.
Deduplication is often performed on data in a storage system after the data has already been stored in a persistent storage device or facility in the data storage system. This approach is sometimes referred to as post-processing deduplication. Post-processing deduplication is beneficial in that it can be performed during periods of time in which the demands placed on the data storage system are lower, thereby reducing any effect on the read/write performance of the data storage system. In some cases, post-processing deduplication may be performed when the data storage system is otherwise offline (i.e., not available to its clients). The drawback to this approach is that the storage space savings benefit associated with the deduplication is not realized until some point in time after the data has been initially stored. Consequently, the system still needs sufficient available physical storage capacity to store the non-deduplicated data prior to the deduplication.
In other cases, deduplication is performed as the data is received by the data storage system and prior to it being stored in persistent storage of the data storage system. This approach is sometimes referred to as inline deduplication. While inline deduplication eliminates the need for additional physical data storage space that is needed when conducting post-processing deduplication, it can place significant demands on the data storage system because the complex computational processes involved in deduplication are performed as the data storage system is receiving read and write requests from clients. The computational demands of deduplicating every data object before it is stored in persistent storage can have an impact on the read/write performance of the data storage system because the deduplication operations are performed in real time, or near real time. This can be particularly challenging during the busiest times of operation. When the data storage system is processing an increased number of write requests, fewer computational resources are available to perform the increased number of deduplication operations associated with those write requests. As a result, a data storage system performing inline deduplication may have difficulty satisfying performance metrics.
In addition, inline deduplication typically requires less storage input/output (i.e., disk I/O) than post-processing deduplication. In post-processing deduplication, data objects are first written to persistent storage in non-deduplicated form. Consequently, the post-processing deduplication processing involves the additional I/O operations associated with reading the data objects, and then re-writing the resulting data objects after deduplication has been performed.