The ready ability for a business to store, process and to transmit data is a facet of operations that a business relies upon to conduct its day-to-day activities. For businesses that increasingly depend upon data for their operations, an inability to store, process, or transmit data can hurt a business' reputation and bottom line. Businesses are therefore taking measures to improve their ability to store, process, transmit, and restore data, and to more efficiently share the resources that enable these operations.
The ever-increasing reliance on data and the computing systems that produce, process, distribute, and maintain data in its myriad forms continues to put great demands on techniques for data protection. Simple systems providing periodic backups of data have given way to more complex and sophisticated data protection schemes that take into consideration a variety of factors, including a wide variety of computing devices and platforms, numerous different types of data that must be protected, speed with which data protection operations must be executed, and flexibility demanded by today's users.
In many cases, disaster recovery involves restoring data to a point in time when the desired data was in a known and valid state. Backup schemes to ensure recoverability of data at times in the past are varied. Such schemes have traditionally included periodic full backups followed by a series of differential backups performed at intervals between the full backups. In such a manner, a data set can be restored at least to a point in time of a differential backup. Such an approach can be resource intensive as permanent records of the full and differential backups must be kept in order to ensure that one can restore a data set to a state at a particular point in time, especially to point in the distant past. Further, the process of restoring a data volume from a full and a series of differential backups can be time and resource consuming, leading to delays in making the data available to the users.
One approach to providing a less resource-intensive capacity to restore a data set to a particular prior point in time is temporal storage, also known as time-indexed storage and time-addressable storage. Temporal storage can be implemented by associating a temporal volume with a particular data set. A temporal volume maintains non-present data in addition to the data in its present state. A temporal volume maintains the history of data stored on it, thus providing a way for an application to retrieve a copy of the data at any time in the past.
Temporal volumes provide an infrastructure for maintaining and accessing temporal data. Temporal volumes can be used by applications at all levels, including file systems and database management systems. In addition, temporal volumes can also be used as building blocks for data archival, versioning, replication, and backup through integration with file system and backup products. Temporal volumes preserve temporal content so that the content can be used at a later point in time for snapshots, incremental backups, replication, restoring corrupted volumes or deleted files, etc.
In a normal storage volume, when data changes, a data block is changed in situ. In a temporal volume, when a block of data is changed, the existing block can be preserved, and a new data block can be written to a separate location and associated with a time stamp; metadata in the temporal volume is also manipulated to provide a link to the new data block. Old versions of a data block are maintained even when the data block is deleted. This achieves the effect of maintaining copies of one or more states of the data in the past. This process can also be thought of as continuous versioning of the data on the disk volume, and retaining snapshots of the volume whenever it changes. Another temporal storage implementation provides the same effect of maintaining data at points in time by preserving an existing block along with some record of the time of change, and then writing the new data block to the device.
There are many possible embodiments for temporal volumes. In one embodiment, the contents of a temporal volume can be preserved using an indexing system or structure. An indexing structure can be formed using a space-optimized persistent store by allocating the storage over a cache object. A cache object is a logical storage object that gives an illusion of infinite space, while using only limited actual storage space. The cache object accomplishes this by provisioning storage on an as-needed basis.
In another embodiment, the temporal volume can be divided into one or more regions. A region may be anywhere from one physical block of the disk to regions of kilobytes, megabytes, gigabytes, etc. Each region can have a time stamp associated with it. Applications accessing the temporal volume can specify the time stamps associated with the regions. Alternatively, a time stamp may be specified by an application or the temporal volume manager when data is written to the temporal volume.
Ideally, a temporal volume stores every change that happens to every block of data. But practically, users may be interested in storing only certain changes or images of the volume at only certain points in time or after a defined event. These points at which data is stored on a temporal volume are “checkpoints” of the data. As discussed below, checkpoints can be linked, for example, to the passage of time, the number of changes to associated data, or to a quantity of changes in a section of a volume. Defining the regularity and system of checkpointing can be done by setting a temporal granularity attribute, which is a policy describing when the changes to data on a temporal volume should be stored. The policy will define when a new checkpoint or image of the data on the volume is created internally. Temporal granularity of data can be supplied and maintained in a temporal volume in several ways, including, but not limited to: zero granularity (also known as continuous checkpointing), periodic granularity (also known as regular checkpointing), fixed change granularity, N-change granularity, and application controlled checkpointing.
Zero granularity, or continuous checkpointing, is the ideal case mentioned above. A temporal volume configured with zero granularity maintains every change to the data. That is, whenever a data block is modified, a checkpoint reflecting the modification to the data block is recorded and associated with a time stamp reflecting the time of change. In general, the time stamp is distinct from the concept of a checkpoint. A checkpoint can be thought of as an index point at which modified data is recorded, while a time stamp reflects the time of the data recordation. When a data block is recorded at a checkpoint, the previous version of the data block is also maintained.
Periodic granularity, or regular checkpointing, represents a scenario in which changes to data are stored only at periodic intervals in time. For example, if the granularity is set to two minutes, then an image of modified data will be retained at a checkpoint only every two minutes.
FIG. 1 illustrates an implementation of temporal data storage using periodic granularity. At time t0, a set of data blocks A-G is recorded (110). This can be considered an initial checkpoint for the data, CP0. Each block of data (115) is associated with a time stamp t0 (120). Subsequent to time t0, data in blocks B and E is modified with data B′ (131) and E′ (133), respectively. The new versions of the data blocks are recorded, but with an empty time stamp (135, 137). The chosen periodicity for maintaining data in this scenario is p; thus, the checkpoint at time t1=t0+p is illustrated at 140 and can be identified as CP1. At CP1, time stamp t1 is associated with all data blocks that have been modified since the previous checkpoint CP0. The illustrated example shows that B″ (141) is associated with time stamp t1 (145) and E′ (143) is also associated with time stamp t1 (147). It should be noted that the further modification B″ (141) has replaced B′ (131). The temporal volume is configured such that for every change in data between CP0 and CP1, any block having an empty time stamp, such as 131, will be overwritten. A further example of periodic granularity checkpointing is shown at 150, reflecting the state of data at a checkpoint CP2 associated with time stamp t2. No changes occurred to B″ between CP1 and CP2 and therefore it is still associated with time stamp t1. Block D was modified between CP1 and CP2 to D′ (183), and the modification is associated with time stamp t2 (185). Block E has undergone a further change to E″ (186). The modified block is associated with time stamp t2 (188). The condition of block E (E′) at CP1 continues to be retained (133).
In a temporal volume with an N-change temporal granularity policy, changes to a block of data or a set of data will be retained at a checkpoint only when a set number of modifications to the data have been made.
FIG. 2 illustrates an example of a implementation of an N-change granularity policy 200. At time t0 (210), data blocks A-G are recorded and associated with time stamp t0. This initial state of the data can be considered as checkpoint CP0. Moving down in the table 200 is reflective of real time passing. As time passes, data blocks in the temporal volume can be modified by users or applications. For example, data block B is changed to B′ and subsequently to B″, and then to B″. At this point, data block B has undergone three modifications, with each modification overwriting the previous data in the block. If an N-change granularity policy for the volume is to retain modifications to a block after three changes, then a checkpoint CP1 (220) is recorded for B′″ and B′″ is retained along with a time stamp tB1 reflective of the time of recordation of B′″. A subsequent change, BIV can be recorded but will be overwritten by future writes until the requisite number of changes to block B occurs. Another example is illustrated with respect to block E, wherein E′″ is recorded at checkpoint CP1 (230) and associated with time stamp tE1 and EVI is recorded at checkpoint CP2 (240) and associated with time stamp tE2. Similarly, a series of changes to block G are illustrated wherein at G′″ is retained at checkpoint CP1 (250) and associated with time stamp tG1. In N-change granularity, the decision to record a checkpoint is not tied to an actual time of change as in periodic granularity, but rather is linked to the number of changes to a data block over time. Thus, a checkpoint CP1 can occur at differing real times for each block of data, as illustrated (e.g., tB1, tE1, and tG1).
A similar granularity policy is fixed-change granularity, where changes to a volume are checkpointed and retained when a set amount of data has changed on the volume. For example, if a granularity attribute is sent to ten megabytes, then when ten megabytes of data change on the volume, all modified blocks since the previous checkpoint are associated with a new checkpoint and are retained with an associated time stamp. Unlike with a N-change granularity, the checkpoint associated with each block occurs at the same real time (even though the criteria for checkpointing data is divorced from real time), but the number of changes associated with each individual block of data can differ from block to block within a checkpoint and from checkpoint to checkpoint.
In an application-controlled checkpointing policy, changed data is checkpointed only when an application asks the temporal volume to checkpoint a block of data, a file, a region of data, or the entire volume of data. In application-controlled checkpointing, an application issues an I/O request that specifies a new checkpoint should be created and provides a time stamp to be associated with that checkpoint.
One drawback related to temporal granularity policies is that as time progresses, and therefore the number of stored checkpoints increases, more and more data history accumulates on the volume. For example, if a temporal volume is configured with a periodic granularity policy to checkpoint every second, then volume history accumulates on a per second basis. Maintaining this ever-increasing quantity of data can be costly in terms of resources as additional storage space may need to be dedicated to the temporal volume. Further, older data on a temporal volume can become of less importance as time passes. What is therefore desired is a mechanism to set and enforce a policy decreasing the number of checkpoints for older data, thereby effectively compressing the history, or time axis, of the data on a temporal volume. Such history compression decreases the storage resource requirements for a temporal volume operating over a prolonged period of time.