In a non-journal file system, blocks are transferred directly from computer memory to the file system volume. A system crash during the transfer results in only a partial copy of the blocks. This often causes corruption. By using a journal, it is possible to ensure that the following hold true: 1) transfers from computer memory to the journal happen in an all-or-nothing fashion (using a sequentially-written journal makes this easy) and 2) transfers from the journal to the file system volume happen in an all-or-nothing fashion. If interrupted by a system crash, the blocks are guaranteed to eventually make it to the file system volume. Likewise, if the transaction's blocks are not written to the journal, they will never make it to the file system volume.
The prior art provides a typical approach in which a new region of on-disk storage for a journal (or log) is reserved to track changes to file system metadata, generally by storing the new version of changed blocks that contain metadata information. Periodically, the changed data is copied from the reserved area back to its desired home location. The home location is on the device that holds the file system. The reserved area for tracking changes may be on this same device or may be on another device. For performance reasons, many journaling file systems still use a separate (i.e., different) device for the journal. The transfer of changed data occurs only under carefully orchestrated conditions. In particular, the transfer is coordinated so that any possible interruption (e.g., system crash, power loss) is guaranteed to leave the file system in either a fully consistent state or in a state that requires only a fast scan and replay of the journal to restore full consistency. From an implementation standpoint, the typical approach is to simply modify key elements of the file system software to understand and manage the new journaling/logging mechanism. However, this approach suffers from serious risks, since the file system code itself must be extremely robust and reliable, and the required changes may impact the core data management code paths of the file system. Furthermore, since the blocks containing changed metadata must be moved from one disk location to another, a fair portion of the I/O system's bandwidth may be consumed due to these transfers.
A typical implementation of a journaling file system is to maintain, at all times, a single transaction that accumulates information about changes to file system metadata. Any file system structural changes that occur while this transaction is active are logged to the journal, which is generally maintained in volatile host system memory (i.e., DRAM). Periodically, this transaction is committed by allowing all application-level file system activity to complete, and then forcing the journal entries from memory to the journal area of physical media. A new transaction is started immediately to track further changes. Eventually, when the journal area fills up, the oldest transaction in the journal will have its data blocks transferred to their home locations on the media, thus freeing up journal space for new transactions to use. In the event of a system crash or other interruption, the file system recovery code need only scan the journal to find the last committed transaction. Any changes that were pending in the transaction that was active at the time of the crash can be ignored, since they had not made any changes to the actual file system's metadata blocks. Changes that were committed to the journal prior to the transaction that was active at the time of the crash will still be in the journal afterwards; their changed blocks will be migrated to the associated home locations on the primary volume over time, just as they would have if no interruption occurred.
FIG. 1 illustrates a logical block diagram of a known journaling file system. File implementations using the journaling techniques reduce the amount of time needed to recover from system crashes and return file system metadata to a fully consistent state. Journaling file systems keep track of changes to a file, specifically, those changes that modify the file's inode. Journaling achieves fast file system recovery because, preferably, at all times the data that is potentially inconsistent with the file system volume could be recorded in the journal. Thus, file system recovery may be achieved by scanning the journal and copying back all committed data into the main file system area. A central concept when considering a journaled file system is the transaction, corresponding to a batch of updates of the file system. This batch includes updates of both data and metadata blocks within the file system. A journal block contains the entire contents of a single block from the file system as updated by a transaction. This means that however small a change is made to a file system block, the entire journal block has to log the change.
In FIG. 2, block changes to the file system accumulate in memory until a decision is made to commit them to stable media (i.e., disk) (step 1). The decision may be based on the elapsed time since the last commit or other criteria. The in-memory batch of blocks constitutes a transaction.
To commit, all that is needed is to write the accumulated blocks along with some “tracking information” to the journal volume (step 2). There may still be prior transactions' data blocks in the journal, so the new data is always appended to it.
When the journal fills up, the journal is emptied by copying each journaled block to its true location on the real file system volume (step 3). After this is done, the journal will be emptied and refilling the journal can be done from the top.
The typical implementation of a journaling file system presents various problems.
One problem with the prior art system is that adding journaling or logging capabilities to an existing file system can be difficult and costly.
Another problem with the prior art system is that making changes to the file system code to implement a journaling system is risky since flaws may manifest themselves in data integrity and availability problems. Furthermore, software based implementations suffer from performance penalties associated with excessive data movement between the journal/log area of the disk and the final place of residence for the data.