Scaling data storage systems to hold more information is important as data storage users continue to increase their desired storage requirements. One approach is to aggregate multiple individual storage systems into a cluster that combines the capacity and performance of the multiple individual nodes into one. However, in order to scale smoothly, each node of the cluster must be used appropriately in terms of data stored on the node and processing and throughput required by the node.
Some storage systems optimize storage capacity by eliminating identical copies of stored data. In some cases, stored data is divided into segments. A new segment that is desired to be stored is first compared against those segments already stored. If an identical segment is already stored on the system, a reference to that segment is stored instead of storing the new segment. This is referred to as identity compression.
Other storage systems optimize storage capacity by comparing a new segment that is desired to be stored against those segments already stored and looking for a similar though not necessarily identical segment. If a similar segment is already stored on the system, a delta between the old and new segment is computed and a reference to the old segment and the delta is stored in place of the entire new segment. This is referred to as delta compression.
In using either identity or delta compression with a cluster, in order to achieve high efficiency and find a similar segment each node ideally searches all stored segments in all nodes and in addition can reference, using an index, all segments in all nodes. The ability to search and index all segments on all nodes poses problems in scaling to a large number of nodes because each node must maintain its ability to search and index over all segments stored on each node of the cluster and the inter-node communication can reduce performance.
It is useful in scaling storage using a cluster approach to divide the segments among the nodes so that a single node does not need to maintain its ability to search and index over all segments stored on each node of the cluster and to reduce the communication among the nodes. In identity compression, such division is sometimes based on a fingerprint computed over the segment, the fingerprints being divided into groups or buckets of fingerprints, and the buckets being assigned to nodes so that, given a segment, a fingerprint can be computed and the node that would have the segment, if it already exists in the cluster, can be identified. Because each node only stores some of the segments, each node only needs to index some of the segments.
Unfortunately, in delta compression, fingerprints of segments are not useful for assigning segments to nodes because similar segments do not, in general, have similar fingerprints. Thus, assignment based on fingerprints would usually result in similar segments being stored on different nodes. A cluster which limited its search for a similar segment to a single node would often fail to find a similar segment even when one exists in the cluster.