The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for enabling efficient inline data de-duplication on a storage system in an enterprise storage cloud.
Enterprise storage systems that perform inline data de-duplication perform three major tasks in the input/output (I/O) path that can potentially impact performance of that storage system. First, the storage system performs calculation of hash keys for data chunks for a given I/O. The hash keys serve as unique identities for a chunk of data. Usually, the storage system uses standard methods to calculate the hash key, e.g., message digest algorithm (MD5) checksum or secure hash algorithm (SHA1/SHA2) keys, etc.
Second, the storage system performs lookup of the hash key in the hash key index. This enables the data de-duplication system to determine whether the chunk of data that has just arrived matches with an existing chunk of data or whether it is a new chunk of data that must be written to the storage and its hash key inserted into the hash key index. If the chunk of data matches with an existing chunk of data, the storage system stores only a pointer to the previous data chunk.
The storage system also receives write data from the application server over the wire through small computer system interface (SCSI) or Fibre Channel even in the cases when identical data is already stored in the storage system.