1. Technical Field
The invention relates to distributed data stores. More particularly, the invention relates to a distributed data store with an orderstamp to ensure progress.
2. Description of the Prior Art
Overview
A data store offers insert, delete, and query operations on a set of data items called a collection. Each data item in the collection is called an entry. The set of all possible entries is called the universal set. Insert operations add entries to the collection. Delete operations remove entries from the collection. A query operation specifies a subset of the universal set, and the data store indicates which elements of that subset are entries within the collection. A query is said to cover an entry if the subset specified by the query contains the entry. A distributed data store is a data store implemented using multiple computers and communication links among the computers.
A distributed data store may provide redundancy, meaning that multiple computers may record the same entry. Redundancy can ensure high availability, meaning that the distributed data store can respond quickly to queries from different locations and can respond to many queries at once. Redundancy can also ensure failover, meaning that even when some computers fail, the distributed data store can continue to respond to queries accurately.
Each computer in a distributed data store covers some subset of the universal set, meaning that each computer records any entries in that subset. In a distributed data store providing redundancy, an entry may be covered by multiple computers. Thus, inserted entries are propagated to multiple computers. The times to propagate to different computers may differ, causing an insert to be recorded at some computers before others. Delete operations also propagate to multiple computers. Thus, an entry may be deleted from some computers before others.
Concerns
Inconsistency Due to Settling
Differences in propagation times for inserts and deletes can cause inserts and deletes to arrive at different computers in different orders. While an insert has arrived at some but not all of the computers to be affected by the insert, a query that covers the entry yields a different result depending on which computer the data store uses to answer the query. The same is true while a delete has arrived at some but not all of the computers to be affected by the delete. This is referred to as inconsistency due to settling.
Inconsistency Due to Order of Operations
Differences in propagation times can also cause inconsistencies that remain, even after a set of operations completes. This is referred to as continuing inconsistency. For example, an insert operation for an entry may begin, followed by the start of a delete operation for that entry, followed by the start of another insert operation for the same entry. As a result, one computer may receive these operations in the order they started, i.e. inserting the entry, deleting it, and inserting it again. Another computer may receive the operations in a different order, i.e. inserting the entry, inserting the entry again, and deleting the entry. If the data store treats multiple inserts of a common entry as a single insert of the entry then, after these operations, the second computer records that the entry is not in the collection. The first computer, on the other hand, records that the entry is in the collection. A query that covers the entry gets a different result, depending on which computer the data store uses to answer the query.
Inconsistency Due to Duplicate Operations
Within a distributed data store, the same insert or delete operation may arrive at a computer multiple times due to communication errors, changes in communication routes, or redundant routes. Also, propagation delays can cause these duplicates of operations to arrive in different orders at different computers. Thus, computers processing duplicates of operations can create continuing inconsistency.
Inconsistency Due to Synchronization
Failure of computers can lead to failure of some insert or delete operations to arrive at some computers. When functionality is restored after a computer failure, the computer may synchronize with other computers that cover the overlapping portions of the universal set to avoid inconsistencies caused by inserts and deletes that occur while the computer is not operating. This synchronization at recovery time, combined with propagation delays, can cause a type of continuing inconsistency referred to as inconsistency due to synchronization. For example, suppose computer A receives an insert of an entry, then a delete of that entry, and then the computer fails. Meanwhile, computer B receives the insert of the entry. While the delete of the entry is still in transit to computer B, computer A restarts and synchronizes with computer B, receiving the information that the entry is in the collection. After the synchronization, the delete arrives at computer B. Now computer A records that the entry is in the collection, and computer B records that the entry is not in the collection.
Query Ceiling
The data store may take a long time to answer a query, especially if the subset specified by the query includes many entries in the data store. While the query is being answered, inserts and deletes may occur on entries covered by the query. This can lead to undesirable query answers in some cases. For example, suppose there is a query on a database of entries corresponding to people, and the purpose is to determine the relative frequencies of different last names. Suppose the data store handles the query in alphabetical order of last names. Suppose the data store is ingesting many new entries as the query progresses. Then the query results are inaccurate because the relative frequencies of last names early in alphabetical ordering are underestimated and the relative frequencies of last names late in alphabetical ordering are overestimated. Avoiding this kind of problem is called imposing a query ceiling.
Prior Art
Mutual Exclusion
One well-known way to avoid inconsistency due to settling is to impose mutual exclusion, allowing either only queries or only inserts and deletes to be in progress at any time by delaying the start of any query operation until all insert and delete operations in progress reach all affected computers, and by delaying the start of any insert and delete operations until all query operations in progress have completed. This form of mutual exclusion imposes a query ceiling by explicitly avoiding inserts and deletes during a query. Similarly, one way to avoid inconsistency due to order of operations is to impose mutual exclusion between inserts and deletes, never allowing both inserts and deletes to be in progress at once. A shortcoming of mutual exclusion is that it causes delays in the distributed system, thus slowing performance.
Counting Inserts and Deletes for Each Entry
Another way to avoid inconsistency due to order of operations is to count for each entry how many inserts and deletes have been received. An entry is in the collection only if the number of inserts is greater than the number of deletes. A shortcoming is that this scheme suffers errors if a computer receives and processes duplicates of insert and delete operations. Also, the desired semantics are often such that multiple inserts followed by a single delete should remove an entry from the collection. Counting does not support such semantics.
Unique Serial Identifiers
One way to avoid duplicates of insert and delete operations is to issue a unique identifier to each operation. Each computer maintains a list of identifiers of operations processed. If an operation with an identifier in the list arrives, the computer ignores the operation. If the unique identifier is serial, that is, if it increases with each operation, then it can be used to impose a partial query ceiling, as follows:                Label each entry recorded in each computer with the greatest unique serial identifier of any insert operation on the entry.        For queries with unique serial identifiers before that of the label on the entry, ignore the entry.        
This prevents an insert after a query starts from being included in the answer to the query. However, this does not prevent a delete after a query starts from affecting the answer to the query. It also introduces a potential error, as follows. Suppose that an entry is in the collection, a query starts, and then the entry is re-inserted. When the query is processed on the entry the query ignores the entry because the entry label is after the query identifier. Another potential shortcoming of unique serial identifiers is that they may be issued from a single site to ensure they are unique and serial, which causes a bottleneck because each operation must access the single site before proceeding.
Timestamps
The data store may label each operation with a timestamp, indicating the time at which the operation began. Timestamps are non-unique serial identifiers. They can be issued locally, avoiding the problem of all operations having to access a single site. However, they introduce the issue of ties, in which multiple operations begin at the same time according to the computers that issue the timestamps.
Orderstamps
The data store may label each operation with an approximate timestamp that is also a unique serial identifier. Such a label is referred to as an orderstamp. One way to create an orderstamp is to encode the time at which an operation begins in the high-order bits of a label and encode a unique identifier corresponding to the computer at which the operation begins in the low-order bits. The same time zone should be used to produce the time on all computers in the system. Then orderstamps can be compared to determine, up to the accuracy of clock skew among processors, the order in which operations began.
Pseudo-Time
D. Reed, Naming and Synchronization in a Decentralized Computer System, MIT/LCS/TR-205, MIT (1978) refers to orderstamps as pseudo-time. The thesis teaches methods to use pseudo-time to maintain consistency in a distributed data store. Those methods are very conservative, aborting operations that might interfere with each other. A drawback of those methods is the possibility of what the thesis calls dynamic thrashing, in which operations may be delayed indefinitely by having other operations cause aborts each time the operations are retried.