Many client applications and operating system programs use a transactional model to insure the consistency of a dataset in data storage. Changes to the dataset are captured in transactions. Each transaction is performed in such a way that in the event of a system failure, it is possible to complete all of the changes of the transaction so that the dataset is restored to a consistent state.
For example, a single transaction in an accounting application transfers a certain amount of money from a first account to a second account. This transaction debits the first account by the certain amount and credits the second account by the same amount. If a system failure occurs during the transfer, the dataset of the accounts can be left in an inconsistent state in which the accounts do not balance because the sum of the money in the two accounts has changed by the certain amount. In this case, the transactional model permits a recovery program to restore the dataset to a consistent state upon reboot of the system after the system failure.
Operating system programs such as file systems and database managers typically use the transactional model to restore a file system or a database to a consistent state upon reboot of a data processor after a system failure. In the case of a server, transaction logging is one widely-used method of using the transaction model. Transaction logging involves writing a record for each transaction to a transaction log in data storage before the writing of the changes of the transaction to the dataset in data storage, so that the transaction log can be used to restore the dataset to a consistent state after a system failure.
For example, a client application sends a transaction request to an operating system program, and the operating system program responds by writing a corresponding transaction record to the transaction log, and then returning an acknowledgement of completion of the transaction to the client application, and then beginning a task of writing the changes of the transaction to the dataset in storage. In this fashion, the use of the transaction log permits the processing of a next transaction to begin before the changes of a previous transaction are written to the dataset in storage. Latency of responding to the transaction request is reduced by writing the transaction record to the transaction log in data storage faster than the corresponding changes can be written to the dataset in data storage.
Upon reboot of the data processor after a system failure, the transaction log may include many records of transactions not-yet-completed by the time of the reboot. In this case, a recovery program replays all of these not-yet-completed transactions so that all of the changes of the not-yet-completed transactions are applied to the dataset. In this fashion, the dataset is restored to the consistent state requested by the last transaction request that was acknowledged as completed. Further details of the logging and replay process are described in Uresh Vahalia et al., Metadata Logging in an NFS Server, USENIX 1995, Jan. 16-20, 1995, New Orleans, La., 12 pages, the USENIX Association, Berkeley, Calif.
Presently a typical file server has a data processor including multiple core central processing units (CPUs) sharing a high-speed data cache. Such a data processor has the capability of concurrent multi-threaded data processing in which portions of multiple program code threads are executed simultaneously by different ones of the CPUs. In order to speed up the replay of a file system transaction log in such a file server, multi-threaded replay has been done upon segments of the active portion of the transaction log.
For concurrent multi-threaded replay of a transaction log from an EMC Corporation brand of Common Block File System (CBFS), EMC Corporation has used the following procedure in its file servers. First, the head and the tail of the transaction log are located. The head is the oldest not-yet-completed transaction in the log, and the tail is the newest not-yet-completed transaction in the log. Second, the portion of the log between the head and the tail is read into memory. This is the active portion of the log containing the not-yet-completed transactions to be replayed. Third, the transactions in each log segment are sorted according their file system block number order. Each log segment is a 64 K byte region of contiguous storage locations in the log. Each transaction modifies one or more file system blocks, so that each transaction from the transaction log has one respective record in memory for each of the file system blocks modified by the transaction. Therefore the sorting of the transactions in each segment by file system block number order creates, for each segment of the log, a group of lists of modifications upon file system blocks. Each list is a list of modifications upon a particular file system block modified by one or more transactions in the segment. Fourth, the sorted transactions are processed segment-by-segment by multiple threads. For each segment and for each file system block that is modified by any transaction in the segment, a particular thread is assigned the task of replaying each and every modification upon the file system block. Therefore the thread reads the file system block to obtain a data block, modifies the data block with each and every modification from the respective list of transaction records for the file system block, and then writes the modified data block back to the file system block. Once this is done for all of the transaction records for all of the segments, the replay has been completed so in a final step the recovered file system is mounted for client access.