Data storage is a critical component for computing. In a computing device, there is a storage area in the system to store data for access by the operating system and applications. In a distributed environment, additional data storage may be a separate device that the computing device has access to for regular operations. This kind of data storage is generally referred to as a primary storage, in contrast with a secondary storage, where computing devices also have access to but generally used for backing up. For data protection purposes, it is important to make regular copies of data from a primary storage to a secondary storage. While early backup strategies created complete (full) backups periodically, an alternate technique is to transfer only the incrementally modified data. By stitching together a newly modified data with a previous complete copy on the secondary storage, a new full backup can be reconstructed.
The data protection is generally performed using data protection scheduling, through which regular copies are made from the primary storage to the secondary storage. Traditionally data protection scheduling is based on fixed time intervals. However, without knowledge of the status of the primary storage, the backup may not occur at the best time. For example, one may schedule an hourly backup from a primary storage to a backup storage. The hour interval may not be sufficient when there are substantial data changes in the primary storage (e.g., when the primary storage concurrently runs multiple applications during prime time of a work day). In contrast, the hour interval may be too frequent when there aren't many changes in the primary storage (e.g., when the primary storage is in maintenance during weekend). It is a challenge to perform data protection efficiently.