Write-ahead logging (WAL) in general refers to a family of techniques for providing atomicity and durability, e.g., in connection with database systems. In a system using WAL, all modifications are written to a log before they are applied, and “undo” and “redo” information typically is stored in the log. WAL can be useful in instances where an application program encounters a fault (e.g., its supporting hardware failing, losing power, etc.) while performing an operation. By using a write-ahead log, the application program may be able to check the log and compare what it was supposed to be doing when the fault occurred to what actually was done. In other words, the application may be able to consult the log to help determine whether the operation it was performing succeeded, partially succeeded, or failed, and then make a decision to undo what it had started, complete what it had started, keep things as they are, and/or take some other action.
Write-ahead logs are mainly used in three areas, namely, as a transaction log in a database system that provides guaranteed atomicity and durability (using the common ACID definitions), as a journal in a journaled file system implementation, and as the log in a log structured file system. Each of these areas is discussed, in turn, below. In the meantime, it is noted that, as is known, ACID refers to Atomicity, Consistency, Isolation, and Durability, which are a set of properties that help guarantee that database transactions are processed reliably. “Atomicity” refers to each transaction being “all or nothing”; “consistency” helps ensure that any transaction will bring the database from one valid state to another; “durability” implies that once the transaction has been committed, it will remain so even in the event of power loss, crashes, errors, etc.; and “isolation” helps ensure that concurrent execution of transactions results in a system state that would have resulted if both the transactions were executed serially.
The Algorithm for Recovery and Isolation Exploiting Semantics (ARIES) log recovery protocol has become a standard technique for database transaction logs. ARIES in general involves maintaining dirty page tables (DPT) and transaction tables (TT) in a log. The DPT maintains a record of all of the pages that have been modified and not yet written back to disc, and the first Sequence Number that caused that page to become dirty. The TT, on the other hand, includes all of the transactions that are currently running, and the Sequence Number of the last log entry they caused. A checkpoint is a known good point from which a database engine can start applying changes contained in the log during recovery, e.g., after an unexpected shutdown or crash. In the context of the ARIES protocol, a DPT and TT together form a checkpoint.
Recovery from a checkpoint according to the ARIES protocol generally involves three phases, e.g., as shown in FIG. 1. First, an analysis phase involves scanning forward from the checkpoint, updating the dirty page and transaction tables along the way. Second, a redo phase scans forward through the log again, checking log sequence numbers (LSNs) of the found entries against the DPT, and then subsequently against the page itself, and redoing actions, if necessary. Third, an undo phase unwinds any torn transactions (e.g., transactions that are not completely written and then committed) at the time of the crash by traversing the log backwards, e.g., using backchaining pointers through each transaction. Some commercially available products have created variants of the ARIES protocol, e.g., focusing on ways of separating redo and undo logs into separate log structures, handling node failure and recovery in clustered databases, etc. A chaining pointer refers to the part of a data item in a chained list that gives the address of the next data item, and a backchaining pointer similarly refers to the part of a data item in a chained list that gives the address of the previous data item.
A journaled file system refers to a file system that keeps track of the changes that will be made in a journal (usually using a circular log in a dedicated area of the file system) before committing them to the main file system. A Journaled Block Device (JBD) is the block device layer within the Linux or other kernel used by ext3, ext4, and OCFS2 (Oracle Cluster Filesystem), for example. JBD2 involves a three-pass process, e.g., as shown in FIG. 2. The first pass involves scanning forward through the log for the last valid commit record. This establishes the end of the log. The second pass involves scanning forward through the log and assembling a list of revoked blocks, which are blocks that have been invalidated by subsequent log writes, where each transaction commit writes out the list of blocks it revokes to the log at commit time. The third pass involves scanning forward through the log and writing all of the non-revoked blocks.
There are several types of log structured file system. For instance, the Journaling Flash File System version 2 (JFFS2) performs a complete scan of the medium on mount and constructs an in-memory representation of the directory structure of the file system. Revoked log entries can be identified in this scan as each node in JFFS2 is versioned, with only the most recent version for each block being active. JFFS2, oftentimes used in flash memory systems and included in the Linux kernel, is the successor to JFFS, which also is a log-structured file system, e.g., for use on NOR flash memory devices on the Linux operating system.
The Unsorted Block Image File System (UBIFS) was originally known as JFFS3 and in essence is a hybrid of ReiserFS and JFFS2. It stores the file system as one large B-tree on the medium, and updates to the file-system are written to the various journal blocks that are scattered through the file system. Mutations on the B-tree are recorded in a write-back journal cache (the “journal tree”). Mutative operations are operations on a system that trigger writes to the write-ahead log, and user mutations are mutations sourced from outside the system (e.g., where the nature of which is sometimes outside the systems control). The journal tree is then periodically written down in to the medium. Recovery at mount time involves identifying the journal blocks and then rescanning them to rebuild the journal tree. They also may use a wandering tree where the lowest node in the tree (i.e., the data) is written first and each node is written ascending the tree, until the root node is updated.
UBIFS may be used with raw flash memory media and is a competitor to Log File System (LogFS) as a file system. LogFS is a Linux log-structured and scalable flash file system, oftentimes used on devices with large flash memories. LogFS works in a manner similar to UBIFS, but without the journal tree, e.g., in that it uses a wandering tree to ensure file system updates are atomic.
Database and other information storage systems sometimes use a write-ahead log as transaction or redo/undo log to support transactional, atomically durable write operations. In general, database systems and journaling file systems generally involve the log acting as a supporting data structure to the primary persistent database storage. The bulk of the database is stored in the primary storage, typically with only the recent potentially uncommitted write traffic residing in the log. This means the transaction log is small in size and frequently kept with a strict size bounds (e.g., by forcing flushing of data to the primary data structure when the log becomes too large). The database also typically supports a large set of different mutative operations, and multiple mutative operations may operate on complex overlapping regions of the database.
The complex nature in which mutative operations can interact within a database means that a chronologically forward replay oftentimes is the only simple strategy for log recovery in such a system. Because the log size is kept both bounded and small by continually flushing changes to the primary persistent data structure, the effect of taking multiple passes over the log, and replaying potentially redundant writes on the primary persistent data structure, oftentimes is minimal.
Unfortunately, however, when the size of the live log becomes quite large, database systems/journaling file systems do not work well. For example, a large live log (which could potentially reach multiple terabytes), coupled with the potential for a significant number of redundant log entries, implies that the overhead of not skipping redundant entries, and having to take a two pass approach, could result in too much wasted effort.
Log structured file systems are motivated to use a write-ahead log approach by the restrictions of the physical media on which they are usually used. For instance, write-once media cannot be write-in-place (e.g. CD-R), and NAND/NOR based flash media cannot atomically write-in-place. For these log structured file systems, there generally is no additional persistent data storage. Instead, the log is the system of record. The recovery approach here thus involves either rescanning a small portion of the log that represents the potentially uncommitted directory structure mutations, or rescanning the entire log in a forward direction from an arbitrary point in the log (e.g., the beginning of the medium) and building a transient index of the file system to enable efficient access. In the former case, the recovery process proceeds in a similar manner (and with similar requirements) to the approach used in database transaction logs. In the latter case, although a complete scan of the log is performed, the recovery is not reading the entire dataset, and it is only the metadata needed to rebuild the index that is read.
Sometimes, however, recovery of the entire log is necessary and/or desirable. However, requiring all of the data to be read into volatile memory may make it difficult or impossible to bear the overhead of reading the entire log (including any redundant records) in an effort to find only live data.
Thus, it will be appreciated that it would be desirable to improve upon existing write-ahead log techniques, e.g., for use in in-memory storage and large scale Big Data applications, where it may be necessary or desirable to use a log to persist data, with the only read traffic occurring during recovery while potentially providing restartability, keeping everything in memory, and/or minimizing persistence overhead. In other words, it would be desirable to improve upon current write-ahead log approaches used in relational databases and file systems, which are suboptimal when applied to in-memory store and Big Data scenarios.
As will be appreciated by those skilled in the art, most conventional write-ahead logs are used to support primary storage. Certain example embodiments involve a change to this paradigm, however, in the sense that the log may be the only persistent storage in the system and may be recovered to faster transient storage for runtime use.
In certain example embodiments, the use of a pure key-value schema for the stored data, and simplified set of mutative operations, leads to fewer restrictions on the potential set of recovery processes than conventional write-ahead log based systems. More particularly, using the key/value property allows the live set of data, once identified, to be applied in any arbitrary order, in certain example embodiments. Because recovery is targeted at locating the live set of data, optimizations for eliminating redundant reads from the log (and/or writes to the primary transient storage) advantageously may have a much greater effect on recovery time than they would in a more conventional write-ahead log scenario.
One aspect of certain example embodiments relates to a single pass, reverse chronological approach to write-ahead log recovery. This example approach may in certain instances allow for minimizing service downtime when availability is contingent on the completion of the recovery process.
Another aspect of certain example embodiments relates to recovering data from a transactional write-ahead log for use in in-memory storage and large scale Big Data applications.
Another aspect of certain example embodiments relates to approaches that enable the recovery of all stored data from large write-ahead logs to a transient storage medium, in a space- and time-efficient manner, e.g., as opposed to approaches that focus on either recovering a subset of the data or recovery from small data logs.
Still another aspect of certain example embodiments relates to building a system that deals with the recovery of live data from a very large write-ahead log in a simplified environment with a small closed set of mutative operations, which allows for the alternative approach of performing recovery backwards by scanning the log from the most recent written record backwards in time (and, in other words, finishing with the oldest record).
Yet another aspect of certain example embodiments relates to a reversal in the log scanning direction as compared to prior recovery approaches, which advantageously makes it possible to at least sometimes eliminate torn transactions, identify the most recent data (the live set), and/or avoid reading or replaying revoked and redundant data. FIG. 3 schematically demonstrates the single pass, reverse-chronological order approach of certain example embodiments.
In certain example embodiments, a recovery method for a computer system including a processor and a memory that has encountered a fault is provided. Actions taken by the computer system are loaded to the memory from a write-ahead log maintained on a non-transitory computer readable storage medium, the write-ahead log storing the actions in chronological order. The actions stored in the memory are run through a series of filters in order to identify irrelevant actions that do not need to be replayed in order to recover from the fault. Using the processor, the actions from the memory are replayed until the entire log is replayed in reverse-chronological order, notwithstanding the identified irrelevant actions that do not need to be replayed. The computer system is transitioned from a recovery state to a normal operation state, following the replaying.
In certain example embodiments, there is provided a non-transitory computer readable-storage medium tangibly that stores instructions that are performable by a processor of a computer system that needs to be recovered as a result of a fault taking place. The instructions that are provided include instructions for loading actions taken by the computer system from a disk-backed log that stores the actions in chronological order to memory of the computer system, where the actions loaded from the log are mutative actions that occurred within a time period of interest defined as being between a predetermined time before the fault and the fault; running the actions stored in the memory through a series of filters in order to identify irrelevant actions that do not need to be replayed in order to recover from the fault; replaying, using the processor, the actions from the memory until the entire log for the time period of interest is replayed in reverse-chronological order, while ignoring the identified irrelevant actions that do not need to be replayed; and transitioning the computer system from a recovery state to a normal operation state, following the replay. There is no data dependency between actions recorded in the log and the log is maintained (and in some cases processed via skip chains) such that older actions cannot invalidate newer actions.
In certain example embodiments, a computer system operable in normal and recovery modes is provided. The computer system comprises a processor and a memory. A non-transitory computer readable storage medium tangibly stores a log that stores actions of preselected types taken by the computer system in chronological order. Recovery program logic is configured to operate in connection with the processor when the computer system is in recovery mode to load actions from the log into the memory and filter out irrelevant actions that do not need to be replayed. An object manager is configured to cooperate with the processor when the computer system is in recovery mode to restore objects in memory in reverse-chronological order by replaying the actions from the memory in reverse-chronological order. The processor is further configured to (a) place the computer system in recovery mode when a fault is detected and (b) transition the computer system from recovery mode to normal mode once the object manager has finished replaying all of the actions that occurred within a time period of interest leading up to the fault, except for the filtered out irrelevant actions.
These features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations and sub-combinations to achieve yet further embodiments of this invention.