The present invention relates generally to logging in a transaction processing system, and more particularly relates to a transactional log structure.
In many information processing applications, a server application running on a host or server computer in a distributed network provides processing services or methods for client applications running on terminal or workstation computers of the network which are operated by a multitude of users. Common examples of such server applications include software for processing class registrations at a university, travel reservations, money transfers and other services at a bank, and sales at a business. In these examples, the processing services provided by the server application typically maintains persistent data or xe2x80x9cstatexe2x80x9d of class schedules, hotel reservations, account balances, order shipments, payments, or inventory for actions initiated by the individual users at their respective stations, such as in a database or other proprietary format data store.
Often, server applications require coordinating processing activities of multiple separate programs (which possibly reside on different computers or in separate processes) that may modify or otherwise affect separately stored persistent data, such as database records on different computers or in separate database tables. For example, a money transfer operation in a banking application may involve updates to account information held in separate databases that reside on separate computers that may be geographically remote. Desirably, groups of these processing activities that form parts of an operation are coordinated so as to take effect as a single indivisible unit of work, commonly referred to as a transaction. In many applications, performing sets of activities as a transaction becomes a business necessity. For example, if only one account is updated in a money transfer operation due to a system failure, the bank in effect creates or destroys money.
A transaction is a collection of actions that conform to a set of properties (referred to as the xe2x80x9cACIDxe2x80x9d properties) which include atomicity, consistency, isolation, and durability. Atomicity means that all activities in a transaction either take effect together as a unit, or all fail. Consistency means that after a transaction executes, the system is left in a stable or correct state (i.e., if giving effect to the activities in a transaction would not result in a correct stable state, the system is returned to its initial pre-transaction state). Isolation means the transaction is not affected by any other concurrently executing transactions (accesses by transactions to shared resources are serialized, and changes to shared resources are not visible outside the transaction until the transaction completes). Durability means that the effects of a transaction are permanent and survive system failures. For additional background information on transaction processing, see, inter alia, Jim Gray and Andreas Reuter, Transaction Processing Concepts and Techniques, Morgan Kaufmann, 1993; and Philip Bernstein and Eric Newcomer, Principles of Transaction Processing, Morgan Kaufmann, 1997.
The durability property is important in transaction processing applications because each transaction is usually providing a service that amounts to a contract between its users and the enterprise that is providing the service. For example, if a user is moving money from one account to another, once the user gets a reply from the transaction processing system that the transaction executed, the user really expects that the result is permanent. This expectation may amount to a legal agreement between the user and the enterprise that the money has been moved between the user""s accounts. It is therefore essential that the transaction processing system stores the updates to account data on some non-volatile data storage device, typically a hard disk drive, to ensure that the updates from a completed transaction cannot be lost.
Transaction processing systems typically obtain the durability property via a log-based recovery mechanism, which starts with the transaction processing system writing a copy of all the transaction""s updates of a durable resource (e.g., a database) into a log file while the transaction is executing. On a request to commit the transaction, the transaction processing system first ensures that all records written to the log file are transferred to the hard disk (not merely in volatile cache memory), before then determining to commit the transaction. The updates to the database can then be written out to disk at any time after the decision to commit. If the system fails after the transaction commits and before the updates are made to the database, the updates can still be made during a recovery process using the persisted log records. During recovery, the system rereads the log and checks that each update by a committed transaction actually was made to the database. If not, the system applies the update to the database. When recovery is complete, the results of all committed transactions will be effected in the database and the transaction system can resume normal operation.
In general, the log is kept as a sequential file on disk, and contains a sequence of records describing updates to a durable resource (e.g., database). These records must contain sufficient information to correctly recover the state of the durable resource including all committed transactions in the event of a failure, such as a before-image, after-image and pointer of the portion of the database affected by the update. This part of a log record is termed the log record""s xe2x80x9cbody,xe2x80x9d and typically is provided by a component or subsystem of the transaction processing system referred to as a xe2x80x9cresource manager.xe2x80x9d During recovery, the information in the log record body is read back (such as, by a recovery manager) to the resource manager. The resource manager uses the after-image to effect updates for transactions that committed before the failure, and uses the before-image to reverse updates for transactions that aborted or were not committed before the failure.
In addition to log records for each update in a transaction, the log also contains log records that report when a transaction commits or aborts. Typically, the body of these log records may just contain the identifier of the transaction and an indication whether the transaction committed or aborted.
In many transaction processing systems, the log is managed by a component or subsystem termed a log manager. The log manager provides an interface by which other subsystems, such as the resource manager, a transaction manager and recovery manager, interact with the log. The log manager also provides a portion of each log record referred to as the log record header, which the log manager uses at recovery to identify the sequence of log records in the log, as well as the resource manager and transaction of each log record. For example, the following declaration defines the structure of an example generic log header.
The present invention addresses two problems in log design. A first problem is that of maintaining a persistently identifiable log end (i.e., identification of the last complete log record in sequence) while frequently appending log records to the log. After a failure, the only remaining data from which the log end can be identified is the data already stored out to disk prior to the failure. Accordingly, logs for some prior transaction processing systems have stored a pointer to the end of log at a separate location on disk, such as in a separate disk file or in a header portion of the log file. This end of log pointer, possibly together with other log identifying information, is sometimes referred to as the log xe2x80x9canchor.xe2x80x9d A drawback to this approach, however, is that it is inefficient to write to two separate locations on the disk (i.e., the log end pointer or anchor as well as the appended log records) each time log records are appended to the log. The movement of the write head of the disk drive between the log end and log anchor locations consumes a significant amount of the time to write log records. Further, since log records are frequently written while processing transactions, the extra time to update the log end pointer with each log write detrimentally affects the transaction throughput of the transaction processing system.
A second problem relates to log writes that span multiple sectors of the log file (herein called the xe2x80x9cmulti-sector write problemxe2x80x9d). According to a general model of the disk file that contains the log, the log consists of a sequence of fixed-length sectors. Writes to the log are made in groups of one or more sectors. Further, a failure can occur at any time during a write to the log file, but will at most cause corruption of data in only one sector (i.e., the sector being written at the time of failure) and also lose data that was to be written in subsequent sectors of the multiple-sector write. The term failure here refers to system failures (such as may be caused by a software error or power outage), and not media failures (such as may be caused by physical damage to the storage media or disk itself). As a result, the initial sector (or prefix of sectors) of a multiple-sector write that was interrupted by a failure may be correct, while some xe2x80x9csuffixxe2x80x9d of sectors in the multiple-sector write may be corrupt or missing. However, the fact that the suffix of the multiple-sector write was obliterated by the failure may not be detectable during recovery after the failure. In order to avoid this problem, logs for some prior transaction processing systems limit the size of writes to the log to be a single disk sector. This may be much smaller than the size of log record needed to describe a transaction""s update to a durable resource. (Common disk sector sizes are 4, 8 or 16 Kbytes, which may be smaller than the size of an individual database record.)
In other words, the present invention addresses the problem of how to append data to a log file, such as for a transaction processing system, so that the act of appending persistently indicates the position of the last correctly written log data.
The present invention provides a technique for appending data in multiple sector writes to a log in non-volatile data storage in such a way that the act of writing the appended data alone indicates the last correctly written log data. The technique uses a cryptographic hash value of log data that is to be written as a block of one or more sectors appended to the log. The cryptographic hash value is written along with the block, such as in a header portion of the block. The cryptographic hash value serves as verification that the entire block was actually transferred into the non-volatile data storage sectors.
According to a further aspect of the invention, the technique also uses the cryptographic hash value of the blocks to provide a truncate prefix operation. This operation truncates a block of sectors that forms a prefix of the log and contains stale log records (e.g., log records for completed transactions, or log records that have been copied forward to a later portion of the log in a checkpoint operation), by modifying a part of the log data in the block (e.g., incrementing a byte of a last sector in the block) so as to invalidate a verification check of the block using the cryptographic hash value.