Entities often generate and use data that is important in some way to their operations. This data can include, for example, business data, financial data, and personnel data. If this data were lost or compromised, the entity may realize significant adverse financial and other consequences. Accordingly, many entities have chosen to back up some or all of their data so that in the event of a natural disaster, unauthorized access, or other events, the entity can recover any data that was compromised or lost, and then restore that data to one or more locations, machines, and/or environments.
While data backup is a valuable and important function, the ever increasing volume of data that is generated presents significant problems. In particular, many companies today find their backup and recovery process strained as data growth in enterprise IT environment continues to accelerate at exponential rates, while data-protection solutions have struggled to keep pace.
At least some of the problems encountered in data backup systems and methods concern the amount of physical storage space occupied by data that has been, or will be, stored in the backup system. In particular, there may be a need to be able to determine the amount of physical storage space occupied by the stored data. However, it has proven difficult to make such determinations in some environments, particularly those environments where data stream segmentation and data deduplication are performed.
For example, one specific problem is the inability to measure physical storage space consumed by any ad-hoc user-specified subset of the files in a file system with data deduplication. In such systems, one file may be split into hundreds of millions of segments during the write process. Moreover, these segments may be shared across the newly written file and any other file(s) of the system. The following example helps to illustrate some of the problems encountered in this area.
The physical space of a file F can be denoted as physical_space(F). In such systems, the following will always be true for a file F1 and a file F2:physical_space(F1)+physical_space(F2)>physical_space({F1, F2})That is, the physical space of F1 and F2 measured together may be smaller than the physical space of F1 measured in isolation plus the physical space of F2 measured in isolation. This is true because F1 and F2 may share segments, which may be deduplicated during the write process. However, the physical space for the shared segments should be accounted only once during the physical space measurement process. If the physical space for the shared segments is measured more than once, the physical space measurement for F1 and F2 will be inaccurate, that is, too high.
In light of problems and shortcomings such as those noted above, it would be useful to be able to efficiently measure the physical storage space consumed by an ad-hoc subset of files in a data protection system. As well, it would be useful to be able to measure physical storage space consumed by an ad-hoc subset of files that have been segmented and deduplicated. Finally, it would be useful to be able to determine, with respect to an ad-hoc subset of files, the set of unique segments shared across the files in that subset.