1. Technical Field
The invention relates to distributed data stores. More particularly, the invention relates to a distributed data store with a designated master.
2. Description of the Prior Art
A data store offers insert and delete operations on a set of data items called a collection. Each data item in the collection is called an entry. The set of all possible entries is called the universal set. A tile is a subset of the universal set that exists on a physical machine. Insert operations add entries to the collection. Delete operations remove entries from the collection. A query operation specifies a subset of the universal set, and the data store indicates which elements of that subset are entries within the collection. A query is said to cover an entry if the subset specified by the query contains the entry.
Data is frequently maintained in several locations, i.e. machines or computers. A distributed data store is a data store implemented using multiple computers and communication links among the computers.
A distributed data store may provide redundancy, meaning that multiple computers may record the same entry. Redundancy can ensure high availability, meaning that the distributed data store can respond quickly to queries from different locations and can respond to many queries at once. Redundancy can also ensure failover, meaning that even when some computers fail, the distributed data store can continue to respond to queries accurately.
Each computer in a distributed data store covers some subset of the universal set, meaning that each computer records any entries in that subset. In a distributed data store providing redundancy, an entry may be covered by multiple computers. Thus, inserted entries are propagated to multiple computers. The times to propagate to different computers may differ, causing an insert to be recorded at some computers before others. Delete operations also propagate to multiple computers. Thus, an entry may be deleted from some computers before others.
Consistency of data between machines is a major issue, as accurate transactions depend on agreement of data throughout the network. Conversely, conflicts and erroneous returns may arise due to data updates arriving variously before and after transactions.
For example, if two withdrawals are made from the same bank account but from different locations, it is important that each location have the correct balance to avoid withdrawing more money than is in the account. This is an example of a transactional operation.
Up-to-date data is less important for non-transactional operations. Continuing the above example, deposits may safely be made to the same account from two different unconnected locations, as long as the balance is updated and consistent before any transactional operations, such as withdrawals, need to take place.
One concept to ensure consistency is referred to as settling time. This rule states that once a tile of data has settled to a time, the data in that tile is immutable, and is not allowed to be altered. For example, if the settling time is one minute, once a transaction has taken place on a volume of data, one minute must elapse before another operation on that data can take place.
In some prior systems, answers to queries may change over time, as transactions from various sources propagate through a network to different members of a distributed data store. Although such prior systems may maintain a record of all transactions and may notify the user if the answer to their query has changed, the changed information may already have been acted on and affected further queries. This violates the principle of immutability of data after the settling time has elapsed.
A variety of policies can be used to determine what the settling time should be and enforce it, but the concept suffers from some fundamental flaws.
One issue is that transactional inserts are slowed by having to wait for the end of the settling time, which may be relatively long. Additionally, establishing a reasonable settling time may be essentially impractical in a partitioned network for the following reasons:                In a partitioned network, each part of the network is independent and out of communication with the other; therefore, data may diverge over time. When the partitions are reconnected and the data synchronized, data older than the settling time may be altered. This violates the principle that data should be immutable once the settling time has elapsed.        If the settling time is stopped at the last connected time, no transactional inserts can be done and no transactional queries get any new data.        If the settling time is maintained on one side of the partition, no transactional inserts are possible on the other side of the partition.        
Each computer is in its own frame of reference. Pure, classic synchronization between machines is just a commonly agreed upon frame of reference. The settling time approach is an effort to create a “fuzzy” frame of reference, but as soon as it is partitioned or separated from the network, it fails.
Unique Serial Identifiers.
One way to avoid duplicates of insert and delete operations is to issue a unique identifier to each operation. Each computer maintains a list of identifiers of operations processed. If an operation with an identifier in the list arrives, the computer ignores the operation. If the unique identifier is serial, that is, if it increases with each operation, then it can be used to impose a partial query ceiling, as follows:                Label each entry recorded in each computer with the greatest unique serial identifier of any insert operation on the entry.        For queries with unique serial identifiers before that of the label on the entry, ignore the entry.        
This prevents an insert after a query starts from being included in the answer to the query. However, this does not prevent a delete after a query starts from affecting the answer to the query. It also introduces a potential error, as follows. Suppose that an entry is in the collection, a query starts, and then the entry is re-inserted. When the query is processed on the entry the query ignores the entry because the entry label is after the query identifier. Another potential shortcoming of unique serial identifiers is that they may be issued from a single site to ensure they are unique and serial, which causes a bottleneck because each operation must access the single site before proceeding.
Timestamps.
The data store may label each operation with a timestamp, indicating the time at which the operation began. Timestamps are non-unique serial identifiers. They can be issued locally, avoiding the problem of all operations having to access a single site. However, they introduce the issue of ties, in which multiple operations begin at the same time according to different computers that issue the timestamps.
Orderstamps.
The data store may label each operation with an approximate timestamp that is also a unique serial identifier. Such a label is referred to as an orderstamp. One way to create an orderstamp is to encode the time at which an operation begins in the high-order bits of a label and encode a unique identifier corresponding to the computer at which the operation begins in the low-order bits. The same time zone should be used to produce the time on all computers in the system. Then orderstamps can be compared to determine, up to the accuracy of clock skew among processors, the order in which operations began.
It would be advantageous to provide a system and method for providing consistency and immutability across a distributed data network. As well, it would be advantageous to provide a system and method for providing consistency and immutability for transactional operations across a distributed data network.
In addition, it would be advantageous to provide a system and method for providing consistency and immutability for transactional operations across a distributed data store having a plurality of partitions. Furthermore it would be advantageous to provide a system and method for providing consistency and immutability for operational transactions, while allowing non-transactional operations to proceed at a local computer.