Scaling data storage systems to hold more information is important as data storage users continue to increase their desired storage requirements. One approach is to aggregate multiple individual storage systems into a cluster that combines the capacity and performance of the multiple individual nodes into one. However, in order to scale smoothly, each node of the cluster must be used appropriately in terms of data stored on the node and processing and throughput required by the node.
Some storage systems optimize storage by eliminating redundant copies of stored data which is known as deduplication. In some cases, stored data is divided into segments. A new segment that is desired to be stored is first compared against those segments already stored. If the segment is already stored on the system, a reference is stored in place of the new segment instead of the entire segment. This form of deduplication is referred to as identity compression.
In using identity compression with a cluster, in order to achieve high efficiency and find a matching segment, each node ideally searches all stored segments in all nodes and in addition can reference, using an index, all segments in all nodes. The ability to search and index all segments on all nodes poses problems in scaling to a large number of nodes because each node must maintain its ability to search and index over all segments stored on each node of the cluster.
It is useful in scaling storage using a cluster approach to divide the segments among the nodes so that a single node does not need to maintain its ability to search and index over all segments stored on each node of the cluster. Such division is sometimes based on a hash of the segment so that, given a segment, a hash can be computed and the node that would have the segment, if it already exists in the cluster, can be identified. Because each node only stores some of the segments, each node only needs to index some of the segments. Unfortunately, in assigning segments to different nodes based on a hash of the segment, locality and therefore performance can be lost. Ideally, segments accessed sequentially should be stored sequentially on a disk so that a single disk access can retrieve many related segments at once. Assigning sequential segments to different nodes makes it difficult, in general, to store the segments sequentially.