The present invention generally relates to computer-systems and, more specifically, relates to a method and mechanism for reducing propagation delays in a database system.
A database server is a computerized entity that is used in managing a database, and a client is an application that requests information from a database server. A database system is a collection of one or more database servers. Each database server in a database system is called a node. A node in a distributed database system can perform the functions of a client, a database server, or both. In certain cases, a particular node may perform the functions of multiple database servers, multiple clients, or both. However, for explanation purposes, a node will be described as functioning as a single database server.
A multi-version database system is a database system that manages concurrency control via versions and snapshots of the same database. In a multi-version parallel database system, transactions executing on a plurality of database servers can concurrently access the same database. For a particular transaction, the versions of the data obtained from the database are typically the versions that were current at a particular point in time. The set of all database data that is current as of a particular point in time is commonly termed a xe2x80x9cdatabase snapshot.xe2x80x9d
To provide transactions with database snapshots, a multi-version parallel database system typically stamps each version of data with a logical timestamp. The logical timestamp assigned to a version of data is typically referred to as a version number. When a transaction requires data from a multi-version database, the database server assigns one or more logical timestamps to the transaction. The logical timestamps that are assigned to transactions are commonly called xe2x80x9csnapshot timesxe2x80x9d as they are used to determine the database snapshot that should be used with particular transactions. Snapshot times are generally provided to a transaction through the use of a logical clock that is maintained by each database server.
When a transaction commits, the transaction is assigned another logical timestamp. The logical timestamp assigned upon commit is referred to as the commit time of the transaction. All changes made by a transaction are considered to have been made at the commit time of the transaction. To supply an ordering to changes within a database, the logical clock of a database server is typically incremented after it is assigned as the commit time of a transaction. Consequently, transactions that commit are assigned commit times that are greater than previously committed transactions.
Thus, logical clocks maintained by database servers are used to provide logical timestamps for a variety of reasons (e.g. snapshot times, version numbers, commit times). Because each database server maintains its own logical clock, a transaction may be assigned a snapshot time that causes information to be read that is slightly older than the most current version. However, because the information obtained for a transaction is from a particular database snapshot, the transaction is guaranteed to receive consistent information that was current as of the particular snapshot time.
FIG. 1 depicts a multi-version parallel database system 100 in which a plurality of database servers (104, 106, and 108) provide snapshot times to transactions for accessing database snapshots from a multi-version database 102. In this example, the multi-version database 102 is represented using a plurality of database snapshots 114. Each database snapshot 114 represents a virtual state of data as it existed in multi-version database 102 as of database logical timestamps T23-T31. In this example, T31 represents the most recent database snapshot of multi-version database 102.
Respectively associated with the database servers 104, 106 and 108 are logical clocks 110, 112 and 113. Currently executing on database servers 104, 106 and 108 are transactions 116-118, 120-122 and 124-128. As depicted, each transaction is executing using data from a particular database snapshot.
When a process executing on a database server begins a transaction, the transaction is assigned a particular snapshot time based on the logical clock that is associated with the database server. The snapshot time guarantees that the version of the data items supplied to the process for the transaction will reflect the state of those data items as of the particular snapshot time. For example, if a transaction TX1 is assigned a transaction snapshot time of T1, the process executing the transaction is guaranteed to be supplied with data items that reflect a state as of time T1, even if the data items have been subsequently modified. However while the process is guaranteed to be provided with a consistent set of data items from the database as of the snapshot time, the process may see information that is slightly older than the most current version of the database.
A transaction is considered committed after its changes are made persistent in the database. In certain systems, to commit a transaction, the transaction requests the database server on which it is executing to assign it a commit time. The database server then obtains a commit time from its logical clock and assigns it to the transaction. After obtaining the commit time, the logical clock is then incremented. A log force is then initiated. The log force writes information (redo records) describing the changes made by the transaction to disk. Once the log force is complete, the changes are made persistent in the database. The database may then report to the user that the transaction is committed.
For example, when transaction 120 began executing on database server 106, the database server 106 accessed its logical clock 112 and assigned transaction 120 a snapshot time of T26. Based on a snapshot time of T26, the transaction 120 is provided with a database snapshot 114 as of the logical timestamp T26. Thus, transaction 120 is guaranteed that it will see a consistent set of data from the multi-version database 102 as of snapshot time T26. When transaction 120 completes and is ready to commit, the database server 106 will assign transaction 120 a commit time based on the current value of its logical clock 112. For example, the logical clock 112 of database server 106 may have incremented to a logical time value of T30. Transaction 120 would then be assign a commit time of T30 for committing its changes to the multi-version database 102. These changes can then be seen by transactions that are assigned a snapshot time of T30 or greater.
Because each database server maintains its own logical clock, whether a particular transaction will see a particular change made to the database depends not only on the current logical clock value associated with the database server in which the transaction is executing, but also on the logical clock values that are associated with the other database servers of the database system. Thus, by having each database server maintain its own logical clock, a time dependency is formed between the transactions executing on different database servers of the database system. This time dependency causes a propagation delay to be introduced into the database system that can cause updates that occurred prior to reading the database not to be seen.
A maximum propagation delay period is the maximum interval of time for which changes committed by transaction executing on one database server are propagated to the transactions executing on other database servers of the database system. This maximum propagation delay period guarantees that if a transaction commits changes from one database server, that the committed changes will be seen by all other database servers of the system if a query is performed after the maximum propagation delay period. Thus, if the maximum propagation delay period is greater than zero, then the changes caused by a transaction committing on one database server are not guaranteed to be seen by other database servers if a query is performed after the transaction commits but before the maximum propagation delay period expires.
For example, a husband may make a deposit into an account A using transaction 118. If when the transaction 118 completes it is assigned a commit time of T31 by database server 104, when the changes are written into the multi-version database 102 they will have a database logical timestamp of T31. If after transaction 118 commits, the husband""s wife attempts to obtain the balance of account A using transaction 120 on database server 106, because transaction 120 was assigned a snapshot time of T26 by database server 106, the balance of account A will not reflect the prior deposit that was made by the husband. Therefore, as long as the logical clock 112 is less than T31, any transaction that is assigned a snapshot time on database server 106 will not see the changes made to the multi-version database 102 by transaction 118.
One method of reducing the maximum propagation delay period in a multi-version parallel database system is by synchronizing the logical clocks associated with each database server on a periodic basis. For example, the propagation delay period can be reduced to a maximum of three seconds by having a centralized process periodically requesting each database server to send to the centralized process its current logical clock""s time value every three seconds. After receiving the logical clock values from all of the database servers, the centralized process can then determine the greatest time value and send it to each database server of the system. Each database server then uses the delivered time value to cause its own logical clock to represent a time that is at least as great (i.e. recent) as the delivered time value.
Having a central process that synchronizes the logical clocks associated with each database server on a periodic basis can be used to reduce the maximum propagation delay period to a certain point. However, as the synchronizing period is reduced to achieve a smaller maximum propagation period, the overhead associated with constantly sending messages to each database server quickly becomes unreasonable. In addition, in certain systems, it is imperative to maintain a maximum propagation delay of zero to prevent data anomalies. However, using a central process for sending messages to periodically synchronize the logical clocks associated with each database server cannot completely eliminate the propagation delay as there will always be a period of time when the logical clocks are not synchronized.
One approach to implementing a maximum propagation delay of zero in a multi-version parallel database system involves requiring each database server to consult a global logical clock service to obtain a most recent logical time value (timestamp) for its logical clock before it assigns a snapshot time to a transaction. For example, one database server, called a global clock service, can be assigned the task of running a global clock. When other database servers need to assign a transaction a snapshot time, they send a message to the global clock service requesting a timestamp. Upon receipt of such messages, the global clock service would generate a timestamp, either by reading its hardware clock or, more easily, by incrementing a software-based serial number counter, and send the timestamp to the requester.
This approach is known as the xe2x80x9con-demandxe2x80x9d approach and is in common use today. However, the xe2x80x9con-demandxe2x80x9d approach has a substantial drawback. As the system grows, a greater number of database servers must communicate with the global clock server. As a result, more time is spent waiting for the global clock server to handle all the requests. Certain improvements to the global clock server approach can be made. However, the basic design is ultimately limited by the single global point of timestamp generation, which can become a performance bottleneck for the entire system. In addition, and possibly more importantly, each process executing a transaction is required to block itself while waiting for a timestamp to be obtained from the global clock, thus causing a context switch to occur. These context switches typically require several thousand instructions and thus inducing a large amount of overhead into a database system.
Another approach for reducing propagation delays between a plurality of database servers that have access to a common database involves associating each database server with a logical clock. In response to initiating a commit of a transaction executing on a database server, a commit time for the transaction is determined and broadcast to one or more other database servers. The broadcast is overlapped with a transaction log force. Upon receiving the commit time, the database servers compare the transmitted commit time to the time indicated by their logical clock. If the commit time is greater than the time indicated by their logical clock, then the database server sets its logical time to reflect a time that is at least as recent as the time reflected by the transmitted commit time.
While this approach is able to overlap the transaction log force and the broadcast of the commit times, the approach still incurs a significant cost for broadcasting the commit timestamps. If there are N database servers, then Nxe2x88x921 commit messages and Nxe2x88x921 commit acknowledgments must be sent. The transaction is considered committed only after the Nxe2x88x921 commit acknowledgments have been received for the Nxe2x88x921 commit messages. As the number of commit messages sent increases, there is a significant increase in both the share of the CPU consumed and the latency of other messages (e.g. lock requests, query response) that may be sent by the database servers.
Based on the foregoing, it is highly desirable to provide a mechanism for reducing the cost of the immediate propagation of commit times between database servers.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.