1. Technical Field
This invention in general relates to database systems and methods. More specifically, the present invention relates to parallelized redo-only logging and recovery for highly available main-memory database systems.
2. Description of Related Art
A main-memory database management system (MM-DBMS) keeps a database in main memory to take advantage of the continuously improving the price/density ratio of available memory chips. This memory-centered database architecture simplifies the DBMS and enables the MM-DBMS to better exploit the hardware computing power, such as high-speed L2 cache memory, than a disk-resident DBMS (DR-DBMS) where the database is kept in the disk. For database designers and application developers, the simplicity of a DBMS translates to the ease of optimizing the performance of the overall database systems and their applications.
While the benefit of the MM-DBMS has well been perceived for read-oriented transactions, the MM-DBMS can also achieve a higher performance than the DR-DBMS in update transactions because updates in the MM-DBMS incur only sequential disk accesses for appending the update logs to the end of the log file and occasionally checkpointing the updated database pages to the backup copy resident in the disk.
Logging is essential for MM-DBMS to recover a consistent database state in case of a system failure. The recovery involves first loading the backup database in memory and then replaying the log in the serialization order. Checkpointing helps throw away the old portion of the log file and thus shorten the log replay time. Between these two types of run-time disk accesses, the logging performance is more critical than the recovery performance. If an MM-DBMS relies on a single log device in favor of the simplicity of enforcing the serialization order during log replay, its update throughput during logging is bound by the contention on a single log buffer and the I/O bandwidth of the log device.
To address the problem of this bottleneck, multiple log disks for parallel logging has been used. However, a naïve parallel logging scheme pays the cost of merging log records distributed over multiple log disks in the serialization order during recovery. To overcome this problem, Lee et al proposed the so-called differential logging that exploits a full degree of parallelism both in logging and recovery. See Juchang Lee, Kihong Kim, and Sang K. Cha, “Differential Logging: A Commutative and Associative Logging Scheme for Highly Parallel Main Memory Database,” Proceedings of ICDE Conference, pp. 173-182, 2001.
The differential logging scheme uses a bit-wise XOR operation, both associative and commutative, for a redo operation as well as an undo operation so that the log records, each of which contains the bit-wise XOR difference between the after and before images, can be replayed in a manner independent of their serialization order during recovery. Such order independence enables distribution of log records to an arbitrary number of log disks, leading to almost linear scale-up of the update throughput during logging until it is bound by either the CPU power or the I/O bandwidth.
Not only the logging time, but also the recovery time can also be scaled down proportionally by replaying the log records in each log disk independently in a single pass. Even the process of loading the backup database partitioned over multiple disks may proceed in parallel along with the process of replaying the logs once the main memory database is initialized with zeros. In addition to the benefit of parallel execution, the differential logging scheme also reduces the log volume to almost half compared to the conventional redo/undo logging.
Similarly, in the area of non-differential logging, there is also a need for an efficient logging scheme that can exploit massive parallelism.