The invention relates to a method and mechanism for implementing and operating upon ordered records or objects. A database system is an example of a type of computing system that creates and operates upon ordered records. In database systems, a “transaction” normally refers to an atomic set of operations performed against a database. The transaction may access, create, modify, or delete database data or database metadata while it is being processed. A “commit” occurs when the transaction has completed its processing and any changes to the database by the transaction are ready to be “permanently” implemented in the database system. Because the transaction is atomic, all actions taken by the transaction must appear to be committed at the same time.
Ordered records, such as transaction log records, can be maintained in a database systems, e.g., to allow suitable recovery operations in the event of a system failure or aborted transaction. Some common problems that could cause a system failure or an aborted transaction include hardware failure, network failure, process failure, database instance failure, data access conflicts, user errors, and statement failures in the database access programs (most often written in the structured query language or SQL).
Different types of transaction log records can be maintained in a database system. A common transaction logging strategy is to maintain “redo” records that log all changes made to the database. Each redo record contains information that can be used to modify a portion of a database, e.g., a database block, from one state to its next changed state. If a failure occurs, then the redo records may be applied in order to restore any changes made to the in-memory copy of the database. “Undo” records can also be maintained for all changes in the database. The undo records contain information that can be used to roll back or reverse a portion of a database from a later state to its next earlier state. In one approach, separate records can be maintained for the redo and undo information.
With “write ahead logging”, the redo records logged for a data item must be recorded to disk before the data item can be written to disk. This protects against the situation when a system failure occurs and the version of the database data that is immediately restored from disk does not accurately reflect the most recent state of the database. This may occur because of changes to the data that have only occurred to the in-memory buffer cache, and have not been recorded to disk before the failure. If the on-disk redo log has been properly maintained for these cache-only changes, then recovery can be performed by applying redo records from the on-disk redo log to roll the database forward until it is consistent with the state that existed just before the system failure. An approach for implementing redo records is disclosed in U.S. Pat. No. 6,647,510, issued on Nov. 11, 2003, which is hereby incorporated by reference in its entirety.
In one approach for implementing redo, as each change is made to the database system, a redo record corresponding to the change is written to an in-memory redo buffer. The contents of the in-memory redo buffer are regularly flushed to an on-disk redo log to persistently store the redo records. All redo records for the system are stored in this in-memory redo buffer.
Having a single in-memory redo buffer provides a way of allowing different execution entities that generate redo records in the database (e.g., threads, processes, tasks, etc.) to coordinate the manner in which they allocate space in the on-disk redo log, and thereby coordinate their claims to space in the pre-allocated disk locations for their respective redo records.
However, this approach can suffer from efficiency drawbacks. For example, consider the situation when multiple execution entities are concurrently making changes to the database, and are therefore concurrently generating redo records. This is a common scenario on large multi-threaded/multi-processor systems in which many thousands or tens of thousands of concurrent events may be processed at the same time against a database. A bottleneck may develop as the multiple execution entities contend for space at the head of the in-memory redo buffer to allocate space for their respective redo records. In effect, the requirement to allocate space in the in-memory redo buffer logically causes serialization to occur for the parallel tasks being performed by the multiple execution entities. This serialization can significantly interfere with the performance and scalability of the system.
Accordingly, the present invention provides an improved method, mechanism, and system for implementing, generating, and maintaining ordered (and partially-ordered) records, such as for example, redo records, redo buffers, and redo logs in a database system. In one embodiment, multiple parallel sets of records may be created and combined into a partially ordered (or non-ordered) group of records, which are later collectively ordered or sorted as needed to create an ordered set of records. With respect to database systems, redo generation bottleneck can be minimized by providing multiple in-memory redo buffers that are available to hold redo records generated by multiple threads of execution. When the in-memory redo buffers are written to a persistent storage medium, no specific ordering needs to be specified with respect to the redo records from the different in-memory redo buffers. While the collective group of records may not be ordered, the written-out redo records may be partially ordered based upon the ordered redo records from within individual in-memory redo buffers. At recovery, ordering and/or merging of redo records may occur to satisfy database consistency requirements. These actions solve the redo generation bottle neck problem since, in addition to the multiple in-memory redo buffers, the precise points on the disk-based redo logs does not have to be allocated in advance. Instead, only the range of the on-disk location is identified. Instead of tracking and ordering this information upfront, the burden is moved to the time of recovery for the tasks of specifically identifying and ordering the redo records from on-disk redo log. This approach therefore significantly reduces redo generation bottleneck and makes the redo generation process highly scalable.
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.