Terminology and some general concepts used herein can be found in "Transaction Processing: Concepts and Technology" by Jim Gray and Andreas Reuter; Morgan Kaufmann Publishers, San Francisco, Calif.; 1993.
In prior art implementations of Database Management Systems (e.g., the DMSII product of the assignee hereof), it was observed that the process of managing the transaction log was a major system throughput bottleneck in SMP computer systems with a large number of processors. This was true for a number of reasons. These issues and their solutions are addressed in the following paragraphs.
1. Semaphore Contention
In prior art implementations there was a single Exclusive Semaphore (herein called the Log Lock) which is used to control all of the processes for managing the database transaction log, including: serializing the insertion of log records into the log buffers, check-summing the log buffers, initiation and termination of the disk write operations to the log file, and opening and closing the transaction log files.
On an SMP computer system, at most one processor's worth of work can be accomplished by any collection of tasks which are contending for the resources controlled by a single exclusive semaphore. This is the saturation effect.
Additionally, there is a high processor overhead cost in managing the FIFO contention for this exclusive semaphore in a manner that avoids the convoy phenomenon (which is known to be even more undesirable). This is an exponentially increasing cost as the semaphore approaches saturation.
Reduction in semaphore contention when generating log images is one step that results in increased throughput to the database transaction log.
The logging process is made up of a number of discrete steps. While the final results of these steps must give the appearance of being performed serially, some of the steps may be performed in parallel or even out of order. Four distinct steps are identified, allowing the use of three exclusive semaphores which increases the semaphore granularity of the log algorithm.
These four steps allow the log algorithm to be implemented as a pipeline, greatly increasing the overall throughput of transaction logging. The steps are: (1) Insertion of log records into the transaction log and insertion of log buffers in to the log buffer queue; these operations are protected by the Log Lock. (2) The process that dequeues log buffers for writing to disk, which is guarded by the Log Queue Lock. (3) The process of calculating the checksum value and initiating the asynchronous disk write operation for the log buffer, which needs no mutual exclusion semaphore. (4) The process of waiting for the completion of log buffer disk writes such that the log constraint guarantee can be maintained in a serial fashion; this is protected by the Log IOC Lock.
2. Log Buffer Limitations
The prior art DMSII product utilized two transaction log buffers so that one (the "TOP" buffer) is being filled by the transaction log record images that are being generated by the operation of the database management system, while the second buffer (the "BOTTOM" buffer) is performing asynchronous physical disk write operations to the transaction log file. When the TOP buffer becomes full and the BOTTOM buffer has been written to disk, the two buffers are swapped, and the insertion of log records continues in the new, empty TOP buffer. The use of only two buffers assumes that the write operation of the BOTTOM buffer can always complete before the TOP buffer is filled.
Increases in physical I/O performance have been outpaced by increases in processor speed, the number of processors available on SMP computer systems, and the number of application programs that are now being run on large SMP systems. Consequently, the BOTTOM buffer will seldom be ready for use by the time that the TOP buffer is filled. Logging activity (and therefore transaction processing) can be halted while waiting for log disk write operations to complete.
Two techniques are employed to prevent the rapid filling of the TOP buffer from halting transaction logging activity.
The maximum buffer size is allowed to increase dynamically; the final size may exceed the log buffer specification as declared in the database description.
Using larger log buffers increases the efficiency of the disk write operation. Each disk write operation incurs some fixed amount of processor overhead for initiation and termination of the write operation, plus the I/O costs directly related to the amount of data being written.
Since the total amount of log data to be written is inflexible, decreasing the number of disk write operations requires an increase in the amount of data transferred in each write operation. This is accomplished by increasing the size of the log buffers.
In addition, the number of buffers is increased in conjunction with the implementation of the pipelined transaction log algorithm mentioned above. Increasing the number of log buffers allows the system to queue log write activity during those peak periods when log buffers can be filled faster than they can be written to disk. This technique extends the amount of time that the system can sustain those peak periods before disk write activity begins to negatively impact transaction processing performance. The number of buffers required is a function of the amount of transaction log record data generated during peak processing periods, the size of the log buffers, and the disk write bandwidth of the log file.
3. End of Transaction (ETR) Commit Issues
End of Transaction commit guarantees that all data modifications made by the transaction are recoverable once the task has received the "transaction completed" response from the DBMS. This is accomplished by forcing the TOP buffer to be written to disk as soon as an ETR log record is inserted into it, even though there may be space in the TOP buffer for additional log records. This naive solution usually requires log disk writes to occur more frequently than would be otherwise necessary, especially with large number of application tasks. This imposes a severe throughput performance penalty on the transaction log process.
Prior art Database Management Systems leave the TOP buffer as the TOP buffer until such time as it is actually possible to write that buffer to the log file. After inserting the ETR record into the TOP buffer, the log algorithm waits (with the Log Lock free) until the BOTTOM buffer (the only other log buffer in prior art Database Management Systems) is I/O complete. At that point the TOP buffer can be written. During this wait, the log algorithm allows other tasks to add their log records to the TOP log buffer. The intent is that during this time interval, other tasks can put more log records, including additional ETR records, into the TOP buffer. This is known as the ETR boxcar effect and maximizes the size of the TOP buffer while minimizes the number of buffers that are written to the log disk file.
4. Disk Write Complete Waiting
Unfortunately, the ETR boxcar effect introduces a serious semaphore contention problem. A large number of tasks (one for each ETR record in the TOP buffer) will all contend for the privilege of queuing and writing that TOP buffer to the log file, all at once, and immediately when the previous log buffer disk write operation completes. All these tasks are contending for the privilege of being the one to actually queue the TOP log buffer and get the next disk write operation started. Of course, only the first task will need to perform the actual work, whereas the other tasks will only find, in a serial fashion, that the buffer with their ETR record is no longer the TOP buffer and is already written.
The solution is to separate the first write waiter task from all subsequent waiter tasks (on the same buffer). This first waiter assumes the responsibility for the complete process of (1) waiting for the previous buffer to be complete, (2) queuing the TOP buffer, (3) starting the disk write on that queued buffer, and (4) waiting for the disk write operation to complete. At that point an event associated with the log buffer is caused that will wake up all the other ETR tasks for the write completion of that buffer. Note that as all the other ETR-waiting tasks wake up, they can immediately continue with their processing without serially contending for the identical semaphore.
While the foregoing describes the separation of first and later waiters in an ETR commit context, the process is actually generalized for all tasks that have the need to wait for the write completion of a particular log buffer.
5. Log Bandwidth to Disk
The throughput improvements described above streamlining the transaction log generation process within an SMP processor, allow more processor cycles to be available for other, non log related, processing. However, overall throughput to the transaction log is still limited by the speed at which that amount of log data that can be written to disk system.
This problem of disk write I/O bandpass is addressed with the introduction of Section Log Files. A logical transaction log file can be composed of 1 or more physical disk files (each called a log section file). The number of log section files to use is determined manually, based upon SMP computer system configuration and application workload characteristics. Assuming that each Log Section File is allocated on a separate physical disk media, the maximum write bandpass to the logical transaction file will be the sum of the bandpasses of the individual physical disk media. Log buffers are sequentially distributed, in a round-robin fashion, among the log sections.
The log algorithm can initiate asynchronous write operations to each of the log section files, thus for S log section files, S simultaneous log disk write operations can be initiated. It is only necessary to wait for the previous log buffer write to the same log section before starting a log disk write; since additional log buffers will have been written to other log sections between the two log buffers that physically reside in the same section, there is an increased possibility that no actual waiting will need to occur prior to initiating the disk write operation for a log buffer.
It is necessary to guarantee that log buffer disk write operations appear to complete in a serial fashion, independent of the parallelism occurring during the disk writes.