In data storage systems space is allocated for storing a primary set of user data. Additional storage may be allocated in connection with providing data protection for the primary set of data. Data protection preserves a copy of data at one or more points in time. For example, data protection can include snapshot and/or data replication facilities that generate a backup copy of the primary data. The copy provides protection against data loss in the event of primary data failure.
In a protection storage file system, such as the data protection file systems of EMC Corporation, a file may be protected with several snapshots taken at regular intervals over a given retention period. For example, a given file may be protected for 30 days for backup purposes. If the file has a backup policy of one snapshot a day, then 30 snapshots of the file will be taken and stored in the file system.
The amount of primary storage in a file system may vary over time. The amount of additional storage needed for data protection also varies over time. Allocating too much or too little risks data loss, inefficient storage utilization and/or an increase in the cost of storage.
One technique for determining data protection storage requirements is to estimate a rate of change in the amount of storage used for protecting primary data during a desired retention period, i.e. the amount of time that a copy or copies of the data providing the protection is retained. In a simple protection system where the data is simply copied, the task of determining the amount of storage needed for protection is fairly straightforward—simply multiply the amount of data being copied by the number of copies being retained. But in most modern data protection storage systems storage is conserved by not simply copying the data, but rather by tracking changes to the data, such as by capturing a delta or generating a change log. Thus, in modern data protection systems, the storage requirements are typically determined based on the rate at which the data changes over the course of the retention period.
In the context of a file system with data deduplication, however, the task of determining the rate at which the data changes presents particular challenges. In deduplication file systems a file may be split into hundreds of millions of segments during the write process. Any segment shared between the file being written and any other file is not re-written, but rather recorded in the file's offsets in order to optimize capacity utilization. This makes it difficult to determine how much data was changed without reconstituting the stored file.
Even though deduplication file systems maintain an index that maps any offset of any file to a segment, they typically do not maintain the opposite mapping, i.e. a map from the segment to the file. This is because all of the files in the system can potentially share a segment. Keeping an index data structure per segment would require a prohibitively large amount of storage given that a file system may contain hundreds of billions or more segments.
The task of determining the rate at which the data changes is even more complex when the change rate being determined is for a subset of files, such as files belonging to a particular client of a shared storage system. For example, different clients may issue a different sequence of operations on top of different files resulting in a different amount of overall changed bytes over the course of a given time period in a shared storage system. In a file system with data deduplication and data protection this translates into different amounts of used capacity over the course of a data protection retention period from one client to the next. Estimating the amount of changed bytes in files for a specific client is essential to correctly size the capacity of the file system for each client.