The ready ability for a business to store, process and to transmit data is a facet of operations that a business relies upon to conduct its day-to-day activities. For businesses that increasingly depend upon data for their operations, an inability to store, process, or transmit data can hurt a business' reputation and bottom line. Businesses are therefore taking measures to improve their ability to store, process, transmit, and restore data, and to more efficiently share the resources that enable these operations.
The ever-increasing reliance on data and the computing systems that produce, process, distribute, and maintain data in its myriad forms continues to put great demands on techniques for data protection. Simple systems providing periodic backups of data have given way to more complex and sophisticated data protection schemes that take into consideration a variety of factors, including a wide variety of computing devices and platforms, numerous different types of data that must be protected, speed with which data protection operations must be executed, and flexibility demanded by today's users.
In many cases, disaster recovery involves restoring data to a point in time when the desired data was in a known and valid state. Backup schemes to ensure recoverability of data at times in the past are varied. Such schemes have traditionally included periodic full backups followed by a series of differential backups performed at intervals between the full backups. In such a manner, a data set can be restored at least to a point in time of a differential backup. Such an approach can be resource intensive as permanent records of the full and differential backups must be kept in order to ensure that one can restore a data set to a state at a particular point in time, especially to point in the distant past. Further, the process of restoring a data volume from a full and a series of differential backups can be time and resource consuming, leading to delays in making the data available to the users.
One approach to providing a less resource-intensive capacity to restore a data set to a particular prior point in time is temporal storage, also known as time-indexed storage and time-addressable storage. Temporal storage can be implemented by associating a temporal volume with a particular data set. A temporal volume maintains non-present data in addition to the data in its present state. A temporal volume maintains the history of data stored on the temporal volume, thus providing a way for an application to retrieve a copy of the data at any time in the past. A temporal volume can be a host-based implementation or implemented through an appliance that exports the temporal volume.
Temporal volumes provide an infrastructure for maintaining and accessing temporal data. Temporal volumes can be used by applications at all levels, including file systems and database management systems. In addition, temporal volumes can also be used as building blocks for data archiving, versioning, replication, and backup through integration with file system and backup products. Temporal volumes preserve temporal content so that the content can be used at a later point in time for snapshots, incremental backups, replication, restoring corrupted volumes or deleted files, etc.
In a normal storage volume, when data changes, a data block is changed in situ. In a temporal volume, when a block of data is changed, the existing block can be preserved, and a new data block can be written to a separate location and associated with a time stamp; metadata in the temporal volume is also manipulated to provide a link to the new data block. Old versions of a data block are maintained even when the data block is deleted. This achieves the effect of maintaining copies of one or more states of the data in the past. This process can also be thought of as continuous versioning of the data on the disk volume, and retaining snapshots of the volume whenever the data changes. Another temporal storage implementation provides the same effect of maintaining data at points in time by preserving an existing block along with some record of the time of change, and then writing the new data block to the device.
There are many possible embodiments for temporal volumes. In one embodiment, the contents of a temporal volume can be preserved using an indexing system or structure. An indexing structure can be formed using a space-optimized persistent store by allocating the storage over a cache object. A cache object is a logical storage object that gives an illusion of infinite space, while using only limited actual storage space. The cache object accomplishes this by provisioning storage on an as-needed basis.
In another embodiment, the temporal volume can be divided into one or more regions. A region may be anywhere from one physical block of the disk to regions of kilobytes, megabytes, gigabytes, etc. Each region can have a time stamp associated with the region. Applications accessing the temporal volume can specify the time stamps associated with the regions. Alternatively, a time stamp may be specified by an application or the temporal volume manager when data is written to the temporal volume.
Ideally, a temporal volume stores every change that happens to every block of data. But practically, users may be interested in storing only certain changes or images of the volume at only certain points in time or after a defined event. These points at which data is stored on a temporal volume are “checkpoints” of the data. As discussed below, checkpoints can be linked, for example, to the passage of time, the number of changes to associated data, or to a quantity of changes in a section of a volume. Defining the regularity and system of checkpointing can be done by setting a temporal granularity attribute, which is a policy describing when the changes to data on a temporal volume should be stored. The policy will define when a new checkpoint or image of the data on the volume is created internally. Temporal granularity of data can be supplied and maintained in a temporal volume in several ways, including, but not limited to: zero granularity (also known as continuous checkpointing), periodic granularity (also known as regular checkpointing), fixed change granularity, N-change granularity, and application controlled checkpointing.
Zero granularity, or continuous checkpointing, is the ideal case mentioned above. A temporal volume configured with zero granularity maintains every change to the data. That is, whenever a data block is modified, the modification to the data block is recorded and associated with a time stamp reflecting the time of change. In general, the time stamp is distinct from the concept of a checkpoint. A checkpoint can be thought of as an index point at which modified data is recorded, while a time stamp reflects the time of the data recordation. When a data block is recorded at a checkpoint, the previous version of the data block is also maintained.
Periodic granularity, or regular checkpointing, represents a scenario in which changes to data are stored only at periodic intervals in time. For example, if the granularity is set to two minutes, then an image of modified data will be retained only every two minutes.
In a temporal volume with an N-change temporal granularity policy, changes to a block of data or a set of data will be retained with a time stamp only when a set number of modifications to the data have been made.
A similar granularity policy is a fixed-change granularity, where changes to a volume are checkpointed and retained when a set amount of data has changed on the volume. For example, if a granularity attribute is sent to ten megabytes, then when ten megabytes of data change on the volume, all modified blocks since the previous time stamp are associated with a checkpoint and retained. Unlike with a N-change granularity, the checkpoint associated with each block occurs at the same real time (even though the criteria for checkpointing data is divorced from real time), but the number of changes associated with each individual block of data can differ from block to block and from checkpoint to checkpoint.
In an application-controlled checkpointing policy, changed data is checkpointed only when an application asks the temporal volume to checkpoint a block of data, a file, a region of data, or the entire volume of data. In application-controlled checkpointing, an application issues an I/O request that specifies a new checkpoint should be created within the temporal volume, rather than providing a time stamp with every write.
A file system can be stored on a temporal volume in much the same manner as a file system can be stored on a normal volume. A file system on a temporal volume will, by its nature, contain file system data at each checkpoint stored on the temporal volume in accord with the selected temporal granularity policy. Issues related to file system data recovery on a normal (non-temporal) volume can also be concerns at each checkpoint on a temporal volume.
In general, a file system is a data structure or a collection of files. In the Unix operating system, for example, “file system” can refer to two distinct things: a directory tree or the arrangement of files on disk partitions. The latter has a tangible physical location and can be thought of as a physical file system, while the former is a logical structure and can be thought of as a logical file system. A physical file system is mounted on a portion of a normal volume called a partition. Partition size determines the amount of volume memory space that the file system can use. Volume memory space is typically divided into a set of uniformly sized blocks that are allocated to store information in the file system. Typical file systems have a superblock, inodes and data blocks.
A superblock stores information about the file system. Such information can include size and status of the file system, a label (file system name and volume name), size of the file system logical block, date and time of the last update to the file system, summary data block, file system state, extent maps, directories, free inode maps, and a path name of a last mount point of the file system. A superblock can also include references to the location of additional file system structural files. A superblock contains critical data related to the file system without which the file system could not be accessed, and therefore often multiple, redundant superblocks are made when a file system is created. The summary data block within the superblock can record changes that take place as the file system is used and can include the number of inodes, directories, fragments, and storage blocks within the file system.
Information about each file in a file system can be kept in a structure called an inode. An inode contains pointers to disk blocks of one or more volumes containing data associated with a file, as well as other information such as the type of file, file permission bits, owner information, file size, file modification time, etc. This additional information is often referred to as metadata. Pointers in an inode point to data blocks or extents on the volume in file system memory space.
The rest of the space that is allocated to a file system contains data blocks or extents. The size of a data block is determined when a file system is created. For a regular file, data blocks contain the contents of the file. For a directory, the data blocks contain entries that give inode number and file name of files in the directory. Blocks that are not currently being used as inodes, indirect address blocks, or as data blocks can be marked as free in the superblock. Further, a list of modes in the file system is also maintained, either in the superblock or referenced by the superblock.
In a file system on a normal volume, whenever files are created, extended, truncated or deleted, the file system updates inodes and other metadata that make a file system disk image self describing. Many file system operations involve multiple metadata changes. For example, when a file is extended, its inode must be updated to reflect the extension and the storage space into which the file is extended must be moved from the file system's free space pool. Most file systems cache metadata changes and write them lazily in order improve I/O performance. Lazy writing of metadata changes causes a possibility that cached metadata updates may be lost in the event of a system crash, thereby making the file system metadata inconsistent with actual data.
One method of verifying and repairing file system integrity, including metadata inconsistency, is to run a program that validates file system metadata and repairs the metadata, if necessary, before the file system is mounted. Such file system validation programs (e.g., fsck (Unix) and CHKDSK (Microsoft Windows®)) can perform tasks such as verifying that disk blocks are not lost or multiply allocated. File system validate programs can also undo partially complete updates, causing recent actions to be removed, but ultimately leaving the file system structurally intact. Such repair programs can take a long time to run and the file system cannot be mounted until the checking is complete.
An alternate recovery technique is used by journaling file systems, which log their intent to update metadata before actually updating the metadata. Each time metadata changes in a journaling file system (e.g., when a file or directory is created, extended, or deleted), the file system logs a description of the updates that constitute the change before performing them. When recovering from a system failure, a journaling file system reads its log and verifies that all metadata updates described in the log are reflected on the storage device. At any instant, the number of metadata updates described in an intent log is a small fraction of the total amount of metadata in a large file system. Therefore, log-based recovery enables file systems to recover from a system crash more quickly than a file system verification program. Similar log-based recovery is available with other types of journaling software, such as databases.
FIG. 1A illustrates a series of changes in data blocks 0-8 in a journaling file system while performing two tasks. The nine blocks in the example file system represent the following types of data:
Block No.Contents0Superblock1, 2, 3Intent log4, 5Inodes6Directory block for directory “a”7Directory block for directory “b”8Data for inode #3The initial state of the illustrated file system contains a superblock S0, no records in the log, block 4 contains inode 3 (“i3”), and directory block “a” contains an association of a name “c” with inode 3. Two transactions will be performed out upon this data. First, an transaction “rename a/c b/d” will be performed, which requires: (i) writing a log record related to the transaction, (ii) removing the entry associating “c” with inode 3 from directory “a”, and (iii) adding an entry associating “d” with inode 3 in directory “b”. Concurrently, another transaction “create a/e” is conducted that requires: (i) writing a log record for the transaction, (ii) allocating inode i4, and (iii) entering i4 into directory “a” and associating the inode with name “e”. In FIG. 1A, these transactions are shown step-by-step taking place at discrete times t0-t6 in the table. Quotation marks in the table imply that data in the block is the same as for that in the prior time step. For example, the steps involved in the rename operation are:                t1: a transaction log entry is made into block 1;        t2: the entry associating c with inode 3 is removed from the directory block for directory a; and        t4: an entry associating named d with inode 3 is entered into the directory block for directory b.A similar set of entries is shown for the create operation.        
At times t2 and t3, the file system metadata is inconsistent with file system data. Inode i3 has been orphaned, meaning the inode has no name space entry. Should the system crash at this point there would be an inconsistent disk image to recover. The transaction log entry in block 1 allows the system to replay the transaction and thereby create an image in which the metadata is consistent with the data. “Replaying the log” means carrying out all pending transactions listed in the intent log (e.g., blocks 1, 2, and 3). FIG. 1B illustrates the data at t2 before replaying the log image (150) (this is the same as the data shown in FIG. 1A at t2), and the data in the file system at t2 after replaying the log (160). The post-replay state is metadata consistent because a name space entry is now present for inode 3 (i.e., name d is associated with inode 3 in directory b) and the transaction is indicated as being completed in block 2.
A temporal volume storing the file system in FIG. 1A has checkpoint images of the state of the file system at each instance in time t0-t6. Therefore, should a user or application attempt to access the file system at a time in the past, for example, t2, the user or application will find that the state of the file system can be metadata inconsistent. The state of the file system at each instance in time will be the same as the state of the file system had there been a system crash at that time. What is therefore needed is a mechanism for maintaining metadata consistent images of a file system, or other types of data journaling software, stored on a temporal volume at each checkpoint stored on the temporal volume.