In typical database systems, users store, update, retrieve and select information by submitting commands or statements to a database application. To be correctly processed, the statements must comply with the database language that is supported by the database application. One popular database language is known as Structured Query Language (SQL).
A logical unit of work that is comprised of one or more database language statements is referred to as a transaction. Examples of database language statements would be insert, for inserting data into a database, and query, for selecting data in a database.
When multiple users share a database system, it is impractical to allow only one transaction to execute at a time. However, when many transactions are allowed to execute at the same time, the issue of consistency arises. Transactional database management systems provide a special statement—“commit”—which a user or program submits to delineate the statements that make up a transaction. Before a transaction commits, all changes made by the transaction are considered temporary, and will be removed if the transaction fails to commit. At the time a transaction commits, all changes made by the transaction are made permanent, and can thereafter be seen by other transactions.
To ensure predictable and reproducible results, techniques have been developed which protect concurrently executing transactions from interacting with one another in uncontrolled ways. Transactions are designed to make database changes in such a way that the database, as a whole, moves from one consistent state to another.
Not only must transactions leave the database in a consistent state upon completion, but transactions must also see the database in a consistent state. This condition is difficult to satisfy because there may be concurrently executing transactions, each of which may be causing a temporary inconsistency visible only within that single transaction. Consequently, transactions must not be able to see the changes made by concurrently executing transactions until those transactions commit.
One approach to ensure that a transaction, to which a query belongs, does not see changes made by concurrently executing transactions is to prevent transactions from reading a data item version that has been updated until the transaction that updated the data item version commits. However, this approach reduces the concurrency in the system by causing transactions that wish to read data item versions (readers) to wait for transactions that have written to data item versions (writers) to commit.
Another way to prevent a query from seeing the changes made by concurrently execution transactions is to assign a “snapshot time” to the query, and then (1) cause the query to see all changes made by transactions that committed before the snapshot time, and (2) prevent the query from seeing changes made by any transactions that commit after the snapshot time. More specifically, when a query is part of a transaction, the query must see: (1) all the changes that were committed to the database on or before the query's snapshot time and (2) all the changes that the transaction, to which the query belongs, has made (even though that transaction has not yet committed).
It is possible for multiple versions of the same data item to be in cache within a database system. These versions include (1) the current version of a data item (hereinafter referred to as the current version), which includes all changes that have been made to that data item up to the present point in time, and (2) clones of the current version, which are either copies of the current version, or copies of other clones, made at various points in time (hereinafter referred to as clones). The set consisting of the current version and the clones of the current version will be referred to as the “data item versions” of the data item.
When a query specifies retrieval of a data item, the database system must determine which of the various versions of the data item can be provided to the query. With respect to a given query, each version of a data item falls into one of two categories: (1) versions that contain everything that the query must see (and have changes in the future of what the query needs. These future changes may be rolled back but it is hard to move older versions of data items forward in time), and (2) versions that are missing changes that the query must see.
In general, versions of data items that fall into the first category relative to a particular query can be used to answer that particular query after any changes that the query cannot see have been removed from the version of the data items. The process of removing changes from a version of a data item is referred to as “rollback”. Versions of data items that fall into the second category relative to a particular query cannot be used to answer that particular query.
One approach to providing data to a query would be to always start with the current version of a data item. By definition, the current version of the data item has all of the changes that have ever been made to the data item. Therefore, it always belongs to category 1 for all queries. Using the current version of the data item often requires (1) making a clone, and (2) rolling out of the clone all of the changes that the query should not see. However, if the snapshot time of the query is old, then there may be numerous changes that have to be removed from the clone, resulting in a rollback operation that consumes large amounts of time and resources. Furthermore, always rolling back changes from the current version only works well on a single node system because, in a multi-node system, a request for a data item could get caught in a situation of chasing the current version of the data item as current version is moved from node to node in a cluster.
If a system does not always start rollback operations using the current version, then the system must be able to determine which category clones fall into relative to the queries that the system is processing. If they fall into category one with respect to a given query, then they may be used to supply data to the query. To facilitate the categorization of clones, “removed-data information” may be stored with each data item version. The removed-data information generally indicates what changes made to a data item version have subsequently been removed from that version of the data item.
For example, assume that the current version of a data item is cloned to create clone X. Initially, the removed-data information of clone X will indicate that no data has been removed. If a rollback operation is performed on clone X to remove the changes made by transaction T1, then the removed-data information of clone X will be revised to indicate that the changes made by T1 have been removed from clone X. If a query does not need to see the changes made by T1, then clone X may be used to provide data to the query (additional rollbacks may be required). If the query needs to see the changes made by T1, then clone X cannot be used to provide data to the query.
Unfortunately, the removed-data information does not always accurately indicate all of the changes that (1) have been made to the data item, which (2) are not in the clone. Specifically, even when the removed-data information indicates that no data has been removed from a clone, there still remains the possibility that changes, which the query must see, are missing from the clone. Specifically, a query may have to see changes that were made to the current version subsequent to the creation of the clone. Therefore, this removed-data information is not sufficient to determine all data that may be missing from a data item version. In other words, a clone may be missing changes that have never been removed from the clone.
The following is an example of how a clone may end up missing data not reflected in the removed-data information of the clone. A transaction, TXB, is making updates to a current version. The current version is cloned, which results in clone Y. TXB continues to make updates to the current version. In this case, clone Y is now missing the updates that were made to the current version after it was cloned, but the removed-data information for clone Y does not indicate that any changes are missing.
One approach to indicate whether changes, not reflected in the removed-data information, are missing from a clone is to mark all existing clones as “old” when the current version of the data item is updated. The operation of marking clones as “old” is hereinafter referred to as invalidation. However, in a multi-node system (hereinafter referred to as a cluster) where the clones can be spread across many nodes, this is impractical because it would entail broadcasting an invalidation message for every update of every data item to all nodes in the cluster that could possibly include clones of the data item.
Based on the forgoing, what is needed is a mechanism for determining which data item versions can be provided to a query, where clones of the data item may reside in multiple nodes of a cluster.