In general, a computer disk holds blocks of data. The disk blocks are numbered, starting at 0 and sequentially proceeding up to the total number of blocks on the disk (less one). The disk blocks are addressed using Logical Block Numbers, or LBN's for short.
An application stores its data persistently on disk within named files. Each file consists of blocks of data, where each block is a sequence of consecutive bytes. Typically, the sequence of consecutive bytes is in the range of 512 bytes up to several kilobytes (fixed by the design of each disk). Blocks within a file are numbered from 1 up to the current maximum in the file. These blocks are addressed using Virtual Block Numbers, or so-called VBN's. Note that the term VBN only has meaning within a specified file. For example, two files can both contain a VBN3, but the content of VBN3 for the two files is totally unrelated. Moreover, these two VBN's are stored on the disk at completely different LBN's.
FIG. 1 is illustrative of the foregoing. Referring to FIG. 1, two small files labeled "spec.txt" and "myprog.exe" are shown. "spec.txt" contains VBN's 1 through 3, while "myprog.exe" contains VBN's 1 through 5. VBNl of "spec.txt" is held in LBN8 of subject disk 100; spec.txt VBN2 is held in LBN2 and so on. The layout on disk 100 of these two files is summarized in the table below:
spec.txt myprog.exe VBN LBN VBN LBN 1 8 1 4 2 2 2 9 3 12 3 6 4 14 5 10
The component of software that remembers the mapping from file VBN's to the disk LBN is called the file system. The file system takes care of allocating, freeing and remembering which logical blocks are used by which virtual blocks within each file at any instant. The file system also remembers the names given by applications to each file and how these files are held together in a directory tree. The file system stores this metadata (e.g., the VBN to LBN mapping) on blocks of the disk that it keeps for its own use, but which applications do not normally access.
So an application conducts its operations in terms of reading and writing blocks of data to a file called by some convenient name, for example "[accounts.monthly]January. details". An application can choose to extend the file, or to truncate the file, either action can change the number of virtual blocks in the file. The file system accepts those requests from the application, expressed as operations upon VBN's within named files, and ultimately turns them into reads and writes upon LBN's for data stored on the disk. The mechanical details of the reads and writes to LBN's on the disk for the file system are handled by another software component that is specific to the disk and called the device driver.
The file system metadata blocks must be stored persistently in reserved memory, i.e., permanently across computer reboots within reserved areas of the disk.
One approach for maintaining persistent metadata is known as "A-B Logging". See "Transaction Processing: Concepts and Techniques" by Jim Gray and Andreas Reuter in Morgan Kaufmann Series in Data Management Systems. In A-B logging, the subject metadata is written to blocks in area A of the disk. A copy of those blocks is made in area B of the disk, where A and B are separate and distinct areas of the disk. The metadata stored in area B are used until a block failure or other failure occurs. In that event, area A serves as a backup copy of the metadata. This meets the requirements of 3. above. However, maintenance of such dual areas imposes a high overhead (against requirement 4. above).
Similarly, redo logging, undo logging or a mixture of both types of logging may be used to provide atomic updates to a number of disparate pieces of metadata. See C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh and P. Schwarz, "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging," ACM Transactions on Database Systems 17, No. 1 (March 1992). This solution involves writing the updates to a sequential log before updating the disk locations of the metadata. Once the operations have been written to the log, the operations are said to be committed, in other words, the metadata changes are stable and persistent. A background task is required to roll the changes written in the log to the disk location. While this can be done at leisure, additional I/O's are required, thus lowering the theoretical maximum throughput, particularly where most operations are a single block update.
In another approach called ABC Update, if metadata requires 100 blocks of space, then the disk management system allocates three times this amount at the start of the target disk. The three allocated areas are called the "A", "B" and "C" area, and when the disk management system is transactionally consistent, the three areas contain identical copies of the 100 blocks of metadata. To provide persistence, the ABC update approach provides an AddRecord operation that write-locks some metadata block, F, which is known to hold some free slots, and modifies the in-memory copy of that block to reflect the fact that it selects one of these free slots for use. The ABC update approach then write-locks the metadata block, M, that will map the new user record, R, (in effect it is a mapping from the key field in record R to the slot where that records is actually stored within the disk). The in-memory copy of that block is then updated to hold this new mapping. The disk management system then writes out record R to the selected slot. Once that is complete, the disk management system writes out metadata blocks M and F to their locations in the "A" area. Once complete, the disk management system does the same to the "B" area. In turn, the disk management system does the same to the "C" area.
Note that each transaction always writes to A, B and C areas in that order. In total, using the ABC update approach, the disk management system performs one user write, followed by six metadata writes; however, those metadata writes are done as three pairs, where each pair is issued to the underlying device driver in parallel.
After all these writes have completed, the transaction releases its exclusive write lock on metadata blocks M and F and acknowledges successful addition of the user record back to the caller. Each block in each metadata area includes a headlet that describes a transaction of which it forms a part. The headlet is formed of a "TxNum" part and a "TxSize" part. The "TxNum" part is a 64-bit integer which uniquely identifies the transaction. And the "TxSize" part holds a value which indicates the total number of metadata blocks involved in the transaction. Note that writes to any one metadata area are issued in parallel--the disk is allowed to reorder parallel I/O's, and therefore metadata blocks M and F may be updated in either order.
Consider first the case where there are no disk errors--all blocks can be read by recovery. In this simplified case, all one needs is an "A" area and a "B" area; the "C" area is not required. Suppose the system just has ten blocks of metadata. Before a subject transaction, the on-disk state is, happily, transactionally consistent and shown below.
 Block 1 2 3 4 5 6 7 8 9 10 A 27.2 39.1 37.2 27.2 35.1 37.2 41.3 41.3 38.1 41.3 B 27.2 39.1 37.2 27.2 35.1 37.2 41.3 41.3 38.1 41.3 C 27.2 39.1 37.2 27.2 35.1 37.2 41.3 41.3 38.1 41.3
This table shows the transaction headlet in each of the ten metadata blocks, for areas A, B and C. The contents of the three areas are identical. The two numbers in each box represent the transaction identifier "TxNum", followed, after the dot, by "TxSize" the number of metadata transaction blocks involved in the transaction. One can see six transactions, identified by TxNums 27, 35, 37, 38, 39 and 41. They are all complete. For example, transaction 27 affected two blocks, namely block numbers 1 and 4. The same holds true for the other (five) transactions.
Consider a new AddRecord transaction. It is allocated the next unused transaction identification number 42, and it updates blocks 4 and 8, for example. The transaction headlet in these blocks must therefore be 42.2.
Now suppose that the system fails just after writing lock 4 but before writing block 8. Recovery will then see the situation below.
 Block 1 2 3 4 5 6 7 8 9 10 A 27.2 39.1 37.2 42.2 35.1 37.2 41.3 41.3 38.1 41.3 B 27.2 39.1 37.2 27.2 35.1 37.2 41.3 41.3 38.1 41.3 C 27.2 39.1 37.2 27.2 35.1 37.2 41.3 41.3 38.1 41.3
Recovery will repair this situation by starting with the metadata block 1. The Recovery routine reads the version in the A area into that metadata's block in-memory buffer. It then reads the version in the B area and compares their transaction headlets. If the headlets agree, Recovery overwrites version A from memory onto version C, bringing A, B and C into harmony. Often this overwrite will be unnecessary--however, it is cheaper to overwrite the metadata block in C area than it is to compare contents in order to avoid a redundant overwrite.
On the other hand, if versions A and B disagree, metadata block 1 will be added to a list of in-doubt transactions. Recovery repeats this process for each metadata block. As it makes additions to the in-doubt list, it checks whether the new addition completes a prior listed in-doubt transaction. If so, the Recovery routine writes the A version of all metadata blocks involved in that transaction to areas B and C, then removes it from the in-doubt list.
Below is an example of the in-doubt list. Here transactions 22, 25 and 27 are currently in doubt. The corresponding "TxSize" for transaction 22 indicates three metadata blocks involved in the transaction. The Recovery routine is required to find all three partners of transaction 22 before it is no longer in-doubt, but so far only two partners have been found (namely, metadata blocks 4 and 7), as illustrated in the table below. Similarly, for transactions 25 and 27, only one of two respective partners have been found.
 TxNum TxSize Metadata Blocks 22 3 4 7 25 2 6 27 2 9
Having read all the metadata blocks, if there are any transactions still left in-doubt, then the Recovery routine rolls back those transactions by copying version B of each metadata block involved into both area A on disk and area C on disk, and into the in-memory buffer. The last action of Recovery is to decide the next TxNum to be issued, as simply one higher than the highest TxNum it encountered during Recovery. Also it sets an indication of the last transaction on (i.e., committed to) disk as TxHard=TxNum -1. That completes ABC Recovery with no disk failures.
The C area is needed as well as the A and B areas to enable recovery from a single block write failure in the presence of disk failures, such as spontaneous decay of a metadata block. That is, if the write of any metadata block to area A were to fail midstream due to, say for example, a system crash and leave garbage in that block, then after rebooting, the recovery routine as mentioned above falls back to old contents of that metadata block held in the B area. In addition, however, if the companion metadata block in the B area has spontaneously decayed, then the recovery routine uses the third copy (in C area) to fall back upon. Thus, ABC recovery in the presence of disk errors considers three possible states of each metadata block.
(i) Old--meaning that the transaction did not get so far as updating the block with the data for the subject transaction, so the block still holds the previous data. PA1 (ii) New--meaning that the transaction successfully updated the block with the data for the subject transaction. PA1 (iii)Parity--meaning that the Recovery routine obtains a parity error when attempting to read the block. This may be caused either by an incomplete write operation or by spontaneous decay of that disk block since it was last successfully written. PA1 (i) Keeping two copies of the metadata. The metadata is stored in two areas called the A area and the B area. The areas are updated in turn one at a time. Initially a metadata update is made to the A area. The next update to the same piece of metadata is made to the B area. This means that at any one time one area contains the last metadata that was written (in other words the current version) and the other area contains the previous version of the data. A transaction may consist of multiple updates to different blocks. Each block is written in such a way that the previous version of the corresponding block is preserved in the other area for that block. If the whole of the transaction is not completed, the old version of the metadata (for each block) that has been updated is still on the disk and is able to be used (in a rollforward or rollback) to reconstruct the full, committed state of the metadata. PA1 (ii) Accounting of transaction. The various blocks that are updated as a single transaction are marked as belonging to the same transaction. By scanning the metadata, it is possible to determine those transactions that have completed and those which need to be rolled back. Each block of metadata is stamped with a transaction number and a part count when it is written to the disk. For example, a block is stamped with a transaction number, say for example 10, and a part number, for example part 1 of 3. Only if all parts of in-doubt transactions are found is the transaction said to be committed. As blocks are overwritten, the old transaction number and part is lost, therefore if block A contains transaction 10, part 1 of 2, and it is overwritten with another transaction, the whole transaction 10 is no longer on disk. Therefore, the transaction accounting method has the concept of TxHard to describe those transactions that are in-doubt. PA1 (iii) Duplicating each block of metadata. Each individual block in the A area and B area is duplicated as a doublet. In other words, there are four copies of each metadata block on the disk. Rather than write a single block of metadata, the present invention writes a doublet. This is a two-block, single I/O, where the two blocks contain duplicate copies of the same metadata block. By writing a doublet, two copies of the current block are written to disk as a single I/O. This allows the metadata to survive a single block failure since such a failure only affects half of a doublet.
The Recovery routine, as mentioned earlier, looks across all the metadata blocks that comprise a transaction in order to decide whether to roll forward or roll back any part-completed transaction. Further, the Recovery routine determines whether a metadata block contains old or new versions of their data by comparing their transaction headlets. Recovery of a transaction is then as follows.
If the Recovery routine can read all three versions in areas A, B and C of the metadata block, then the Recovery routine applies the newest version of the data across the A, B and C areas of the block. Thus, if a write of the metadata block to area A was underway at the time the system failed, then upon reboot the recovery routine rolls the data forward by writing the A area onto B and C areas of the metadata block. If the metadata block was being written to area B at the time the system failed, then the Recovery routine rolls forward by writing A (or B) onto C area of the metadata block.
If the Recovery routine can only read two versions of the metadata block, and one version is newer than the other, then the recovery routine applies the newer of the read versions to the other two versions thus effecting a roll forward.
If the Recovery routine can read only two versions of the metadata block and the two versions have the same transaction number, then the Recovery routine applies either one to the third version. The two good versions may be both in the old state or both in the new state of the metadata, and the Recovery routine cannot differentiate these two cases. However, the Recovery routine accomplishes a roll forward throughout the three versions of the metadata block or a roll back through the three versions of the metadata block. In either case, the metadata is made to be transactionally consistent.
If the Recovery routine can read only one version of the metadata block, then the routine applies that version to the other two versions. This effects either a roll back of all three versions of the metadata block or a roll forward in certain circumstances. In either case, the resulting metadata is transactionally consistent.
Note that the Recovery routine may be interrupted at any point. If, for example, the system fails part way through Recovery, when it reboots, Recovery is able to safely follow the same routine, upon whatever it finds on disk. In other words, Recovery is an idempotent operation.
Although the ABC updates approach for persistence is resilient to single block spontaneous decay and single block write failure, and recovery is idempotent, the cost for this robustness is a drop in performance. Because each update to even a single metadata block requires that the system perform three serial writes (i.e., the next write cannot be issued until the previous one has completed), some performance is lost.
Thus there is a need in the art for improved maintenance (persistence and recovery) of disk metadata.