When a computer system crashes during updating a data structure on a non-volatile storage device (e.g., a disk), the data structure may become corrupted. A data structure typically contains inter-related portions of data. When the data structure is only partially updated (e.g., because a crash prevented the completion of an update), the inter-relation among portions of the data may become invalid, leaving the data structure in a state of inconsistency.
For example, a file system typically contains metadata, which organizes user data in a storage unit. A file system typically includes metadata to describe the location, the size, and other information about files in the storage unit. A file system may also maintain metadata to identify the free space on the storage unit which can be allocated for the storage of additional data. If the file system metadata is in an inconsistent state, the system may crash or corrupt user data during operation.
An operating system is typically programmed to update and access the file system metadata in a consistent fashion. The file system metadata may be cached in the volatile memory of the computer system for fast access. To cleanly shutdown a file system, an operating system typically puts the metadata of the file system on a non-volatile storage device in a consistent state by completing any pending write operations and flushing the data from cache into the non-volatile storage device.
However, if a computer crashes, unexpectedly reboots or loses power, the file system metadata on the non-volatile disk storage may suffer corruption if the metadata is only partially updated. Thus, after an unclean shutdown, an operating system typically checks the file system metadata for consistency to validate the file system.
On large file systems checking the file system metadata for consistency can take a very long time. Further, the repair process may not always be able to fix all possible types of corruption. After a crash, a recovery process may need user intervention to bring the file system metadata into a consistent state.
Data consistency is also of concern to databases users. Traditionally, databases use transaction processing techniques to maintain database consistency in the presence of a system crash. One transaction processing technique is to group one or more write operations into a transaction so that the data system is consistent before and after the transaction. The operations for a transaction is logged but not performed before a request to commit the transaction is received. A transaction commit operation updates the database according to the log. The log is typically in a form such that, after a partial commitment of the transaction, the database system can roll back to the state before the transaction or replay the log to reach the state after the transaction.
Before and after the execution of the transaction commit operation, the database is in a consistent state; during the execution of the transaction commit operation, the database is typically in an inconsistent state. If a crash happens during the execution of the transaction commit operation, the log can be used to roll back to the consistent before-transaction state or replayed to reach the consistent after-transaction state.
In a journaling file system, a complete set of modifications made to the on-disk structure of the file system is organized as a transaction. In a way similar to the database operations, a journaling file system maintains a log of the operations to perform one or more transactions. After a crash, uncompleted transactions can be replayed according to the log to bring the system to a consistent point.
Certain copy-on-write file systems maintain multiple versions of files. For example, a Write Anywhere File Layout (WAFL) file system has algorithms and data structures to implement snapshots, which are read-only clones of the active file system. WAFL stores metadata in files, including the inode file which contains the inodes for the file system, the block-map file (e.g., in the form of a bit map or an extent map) which identifies free blocks, and the inode-map file which identifiers free inodes. An inode typically includes information of a file regarding user and group ownership, access mode (e.g., read, write, execute permissions) and type, locking information, the number of links to the file, the size of the file, access and modification times, the addresses of the blocks of the file, etc. WAFL keeps metadata in files so that meta-data blocks can be written anywhere on disk.
A WAFL file system is in the form of a tree of blocks. At the root of the tree is the root inode that describes the inode file. The inode file contains the inodes that describe the rest of the files in the file system, including the block-map and inode-map files. The leaves of the tree are the data blocks of the files.
WAFL creates a special snapshot periodically (e.g., every few seconds) to obtain a completely self-consistent image of the entire file system and mark a consistent point. Between consistency points, WAFL write data only to blocks that are not in use, so the tree of blocks representing the most recent consistency point remains completely unchanged. WAFL uses non-volatile RAM (NVRAM) (e.g., special memory with batteries that allow it to store data even when system power is off) to keep a log of write requests processed since the last consistency point. After an unclean shutdown, WAFL replays any requests in the log to prevent data loss.