The present disclosure relates generally to deduplication storage systems, and more particularly to methods and systems to improve rehydration performance in deduplication storage systems.
Storage systems may store large amounts of data originating from many different computers. For example, a backup storage system may regularly receive data from many different end-point computer systems (e.g., desktops in an organization), and may store backup copies of that data in a data store. Each source computer may request insertion of files (or file segments) into the storage system, and later request removal of any of the files or segments.
One method of efficiently storing data includes data deduplication, which attempts to reduce the storage of redundant data. Deduplication mechanisms may both remove duplicate data already stored in shared storage and disallow duplicate data from being stored in shared storage. Accordingly, one copy of unique data may be stored, which reduces the needed storage capacity.
However, unless techniques are employed to maintain spatial locality of backup data, deduplication techniques may tend to cause data segments from a particular backup data set to become scattered over time. Scattering of the backup data may cause an artificially large number of storage containers to be required for storing deduplicated data, and accordingly performance may suffer during rehydration (a process for reconstructing backed-up original files from segments stored in a deduplication system).
Furthermore, the scattering of data segments from a particular backup data set may result in non-sequential storage of backup data. Thus, conventional methods that are well-suited to retrieval of sequentially stored data may provide limited effectiveness when implemented on backup data sets that include scattered data. Accordingly, rehydration performance of conventional techniques may suffer.