Computer systems may include different resources used by one or more host processors. Resources and host processors in a computer system may be interconnected by one or more communication connections. These resources may include, for example, data storage devices such as those included in the data storage systems manufactured by EMC Corporation. These data storage systems may be coupled to one or more host processors and provide storage services to each host processor. Multiple data storage systems from one or more different vendors may be connected and may provide common data storage for one or more host processors in a computer system.
A host may perform a variety of data processing tasks and operations using the data storage system. For example, a host may perform basic system I/O operations in connection with data requests, such as data read and write operations.
Host systems may store and retrieve data using a data storage system containing a plurality of host interface units, disk drives, and disk interface units. Such data storage systems are provided, for example, by EMC Corporation of Hopkinton, Mass. The host systems access the storage device through a plurality of channels provided therewith. Host systems provide data and access control information through the channels to the storage device and storage device provides data to the host systems also through the channels. The host systems do not address the disk drives of the storage device directly, but rather, access what appears to the host systems as a plurality of logical units, logical devices or logical volumes. The logical units may or may not correspond to the actual physical disk drives. Allowing multiple host systems to access the single storage device unit allows the host systems to share data stored therein.
Data deduplication is a specialized data compression technique used to improve storage utilization by eliminating duplicate copies of data. In a deduplication process, a data storage system identifies a received chunk of data, such as by executing a hash function on the chunk to generate a hash value that identifies the chunk, stores the chunk, and creates a reference counter for the chunk. When the data storage system receives a subsequent chunk to be stored, the data storage system identifies the subsequent chunk, such as by generating a hash value for the subsequent chunk, and compares the identifier for the subsequent chunk to the identifiers for the previously stored chunks. Whenever a match of identifiers occurs, the data storage system stores the new yet redundant chunk as a small reference that points to the previously stored chunk that has the matching identifier, and increments the reference counter for this previously stored chunk. If the data storage system receives a request to delete this chunk, the data storage system decrements the reference counter for this chunk. If the reference count for a chunk is decremented to zero, the data storage system deletes the chunk and the reference counter for the chunk because the chunk no longer needs to be stored.
For example, a host executes a write command to store data XXX to disk 1 of a data storage device, and the data storage device generates the hash value of H1 for the data XXX, determines that the hash value H1 is not already stored in its hash table, adds the hash value H1 to its hash table, creates a reference counter for the hash value H1, sets the reference counter for the hash value H1 to 1, and stores the data XXX to disk 1. Then the host executes a write command to store data YYY to disk 1 of the data storage device, and the data storage device generates the hash value of H2 for the data YYY, determines that the hash value H2 is not already stored in its hash table, adds the hash value H2 to its hash table, creates a reference counter for the hash value H2, sets the reference counter for the hash value H2 to 1, and stores the data YYY to disk 1. Next, the host executes a write command to store data XXX to disk 4 of the data storage device, and the data storage device generates the hash value of H1 for the data XXX, determines that the hash value H1 is already stored in its hash table, increments the reference counter for the hash value H1 to 2, and stores the data XXX to disk 4 as a small reference to the previously stored data XXX. Subsequently, the host executes a write command to store data ZZZ to disk 4 of the data storage device, and the data storage device generates the hash value of H3 for the data ZZZ, determines that the hash value H3 is not already stored in its hash table, adds the hash value H3 to its hash table, creates a reference counter for the hash value H3, sets the reference counter for the hash value H3 to 1, and stores the data ZZZ to disk 4. Lastly, the data storage system receives a request to delete disk 4, determines that disk 4 stores data that corresponds to the hash values H1 and H3, decrements the reference counter for the hash value H1 from 2 to 1, decrements the reference counter for the hash value H3 from 1 to 0, deletes the reference counter for the hash value H3, deletes the hash value H3 from its hash table, and deletes disk 4.
Storage-based data deduplication processes large volumes of data and identifies relatively large chunks of data—such as an entire file or a large section of a file—that are identical, in order to store only one copy of the chunk of data. For example a typical email system might contain 100 instances of the same 1 megabyte file attachment. Without data deduplication, all 100 instances of this attachment are stored when the email system is backed up, thereby requiring 100 megabytes of storage space for this attachment. With data deduplication, only the first instance of this attachment is actually stored, and the subsequent instances are referenced back to the previously stored instance, resulting in a deduplication ratio of roughly 100 to 1 for this attachment.