Entities often generate and use data that is important in some way to their operations. This data can include, for example, business data, financial data, and personnel data. If this data were lost or compromised, the entity may realize significant adverse financial and other consequences. Accordingly, many entities have chosen to back up some or all of their data so that in the event of a natural disaster, unauthorized access, or other events, the entity can recover any data that was compromised or lost, and then restore that data to one or more locations, machines, and/or environments.
While data backup is a valuable and important function, the ever increasing volume of data that is generated presents significant problems. In particular, many companies today find their backup and recovery process strained as data growth in enterprise IT environment continues to accelerate at exponential rates, while data-protection solutions have struggled to keep pace. Backup performance is crippled by the needs of up-to-date and online business applications.
In challenging environments such as these, resort has been made to techniques such as data compression in order to reduce the amount of storage space consumed by backup data. In connection with data compression processes, it is often useful to be able to determine a data compression ratio for a group of data sets, such as files for example. However, and discussed in more detail below, due to complexities in the environment in connection with which the compression is formed, problems can arise when attempts are made to determine data compression ratios.
In general, data compression refers to any process that can encode information in smaller amount of bits than the original representation, in effect reducing the amount of space a file uses on persistent storage. For example, the Lempel-Ziv (LZ) family of data compression methods may be used for lossless data compression in file systems. In the context of a file system with data deduplication and stream segmentation, each file will be split into a potentially very large number K of small segments of average size S. For example, a file F can be split into the following sequence of segments F=(s1, s2, s3, . . . , sK), and each segment may be compressed before it is stored. In this case, each file will have an average compression ratio R. For example, a compression ratio R=0.5 means that the segments will on average be reduced in size 50% before they are stored. In this case, the compressed average segment size C can be calculated as C=S×R. Thus, if the average compression ratio R=0.5 and the average segment size is 8 KB, the compressed average segment size will be 4 KB.
In a file system with stream segmentation and data deduplication, where a write process is being performed, any segments of any file Fi∈{F1, F2, . . . , FN} can potentially deduplicate against any other segment inside the file system. In practice, this means that a segment may not be rewritten if it was already written to the system in the context of a different file. In order to get better data compression ratios, segments may not be compressed individually. In other words, a segment may be bundled up inside a sequence of segments that are compressed together as a data block. For example, in a Data Domain file system segments are stored inside variable-sized compression blocks that can have from one segment to a few hundreds of segments. The segments inside these blocks are compressed together in order to get better data compression ratios.
However, because of data deduplication, the segments for file F may be mixed up with the segments of any other file(s) in the system. The average data compression ratio will be dependent not only on the sequence of segments (s1, s2, s3, . . . , sK) of F, but also on the specific way segments of F are mixed up with segments of other files in the system. Further, this mixture may change over time as segments are re-written by processes such as a garbage collector, data de-fragmentation processes or any other process that may move data in the underlying file system layers. Furthermore, a user may decide to use a different data compression method for a set of files, which will cause the data for these files to be re-compressed.
In practice this means that in a file system with data deduplication, the data compression ratio for a file can change over time depending on a variety of considerations. Such considerations may include the specific mixture of segments into compression blocks, the specific processes that are run in the background of the file system, and the specific order the files in the files systems are written.
In light of problems and shortcomings such as those noted above, it would be useful to be able to efficiently estimate an average data compression ratio R for an ad-hoc set of N files {F1, F2, . . . , FN} at a specific point in time T. As well, it would be useful to estimate an average compression ratio R for a set of N files in an environment where data stream segmentation and/or data deduplication are employed.