An efficient way to build data storage system is to store unique data segments. This method can reduce the required storage space substantially, especially for data that has a high degree redundancy. To improve the write throughput of such a storage system, the challenge is to check whether a data segment is redundant and write the non-redundant ones quickly.
FIG. 1 is a block diagram illustrating a typical storage system. Data are generated from a variety of sources, for instance data sources 100, 102 and 104. The data sources stream their data contents to storage system 106. The storage system receives the data streams, optionally processes the data streams and stores the data to storage devices such as hard drive. The storage system can consist of a single unit that includes processors and storage devices or multiple units in which processors and storage devices are connected via a network.
When moving data such as backup data from the data sources to the data storage. Commonly, there is a substantial amount of data from each of the data sources that remains the same between two consecutive backups, and sometimes there are several copies of the same data. To improve efficiency, storage systems check whether portions of the data, or segments, have been previously stored.
To check whether the segments have been stored previously, storage systems produce segment ID's for the segments. Checks are performed on the data segments or segment ID's to determine whether the same segments have previously been stored to a segment database of the system. Preliminary checking techniques are used to lower the latency associated with the checking and increase search efficiency. For example, information about segments or segment ID's that are likely to be encountered soon is stored in a cache and can be used in a preliminary check. Also, a data derived summary can be used in the preliminary check. If the low latency checks are inconclusive, a high latency check is performed by searching all the previously stored segments or segment ID's for a match.
While this approach achieve some efficiency gains by not copying the same data twice, it still incurs significant latency when the preliminary checks are inconclusive and a high latency check is employed to guarantee that the data have not been previously stored. It would be desirable to have a storage system that could still further reduce latency.