1. Field of the Invention
This invention relates to computer systems and, more particularly, to efficiently reducing a number of duplicate blocks of data stored on a server.
2. Description of the Related Art
Computer systems frequently include data storage subsystems for storing data. In particular, computer systems that include multiple clients interconnected by a network increasingly share one or more data storage subsystems via a network. The shared storage may include or be further coupled to storage consisting of one or more disk storage devices, tape drives, or other storage media. Shared storage of modern computer systems typically holds a large amount of data. Efficient storage of this large amount of data may be desired in order for a modern business to effectively execute various processes.
One method of efficiently storing data includes data deduplication, which attempts to reduce the storage of redundant data. A deduplication software application may both remove duplicate data already stored in shared storage and disallow duplicate data from being stored in shared storage. Then only one copy of unique data may be stored, which reduces the required shared storage capacity.
Indexing of all data in the computing system may be retained should the redundant data ever be required. For example, data may be partitioned and a hash computation may be performed for each partition using any of several known hashing techniques. A corresponding hash value, or fingerprint, of data associated with a write request to the shared storage may be compared to fingerprints of data already stored in the shared storage. A match may invoke the deduplication application to discard the data of the write request, locate the already stored copy in shared storage, and create a reference or pointer to the existing stored data.
The comparisons of fingerprints may utilize a storage of fingerprints, such as in a random access memory (RAM) or otherwise. Upon arrival of data partitions of a write request to the shared storage, a fingerprint is calculated for a data partition and compared to the fingerprints stored in the RAM or other storage. This RAM or other storage may be referred to as a fingerprint index, or index. One design issue is maintaining an index capable of storing fingerprints of all data partitions known to be stored in the shared storage. Since a computer system's storage capacity is typically very large, an index may need to be very large also and the index may not fit in the supplied RAM or other storage. If a portion of the associated fingerprints is stored in the shared storage itself, such as a disk storage, performance may suffer. The disk access speeds may not fast enough to keep up with the rate of index requests.
In view of the above, systems and methods for efficiently reducing a number of duplicate blocks of data stored on a server are desired.