Many companies and individuals with large amounts of stored data employ a backup data storage system. These backup data storage systems can be located local to the data to be backed up or at a remote site. The backup data storage systems can be managed by the entity controlling the primary data storage devices or a backup data storage service company. Data can be backed up at any frequency and any amount of data can be backed up. The backed up data can be retrieved in the case of a failure of a primary storage device from the backup data storage system. Where the backup is managed by a service company the data of many separate clients of the backup service can be backed up into the backup data storage system.
Deduplicating data before storage is widespread within the backup storage service market and is of growing interest in other data storage markets as well. The basic idea is to divide incoming data into smaller units called data chunks, generate a secure hash such as a secure hash algorithm 1 (SHA1) over the data chunk (this hash result is referred to herein as a “fingerprint”), and check the fingerprint against an index of previously stored data chunks. Fingerprints already stored are considered duplicates, while fingerprints that are not indexed cause the corresponding data chunk to be stored and the fingerprint added to the index. In this way only unique data chunks need to be stored. A file has a recipe for reconstruction, which consists of a list of fingerprints and related information corresponding to unique data chunks stored in the backup data storage system. For backup data storage systems, the typical backup cycle consists of daily or weekly full backups of a primary data storage system, so most of the data stored in the backup data storage system is repeated or a ‘duplicate.’ This typically leads to high deduplication rates of 10 times (i.e., 10×) or more.
It has been common to collect traces that consist of disk block ID or other input/output (I/O) level access patterns for data storage pattern analysis and similar analysis. Collecting I/O trace data is of growing importance especially for deduplicated storage systems. However, I/O level traces provide limited insight into the actual data content in the backup data storage systems. An alternative has been to fabricate data content for data storage analysis. However, by its nature the fabricated content is not sufficiently representative of actual data content in the backup data storage systems. The limitations of these two data sets affect the quality of research into improving data storage and backup data storage systems.