The present invention relates to distributed transactions on key-value stores.
Scalable distributed data-stores are increasingly used for storing large datasets in diverse applications. Scale-out data stores, based on key-value abstractions are commonly used as backend or cloud storage for various applications. Systems like CloudDB, Megastore and Spire support OLTP applications on such stores.
The need for transactional support in these applications has motivated several recent efforts. A common theme underlying these efforts is the creation of disjoint groups of objects (entity-groups) on which efficient local transactional support is provided using multi-version concurrency control. A lock-based protocol is used to support distributed transactions across entity-groups. A significant drawback of this scheme is that the latency of distributed transactions increases with the number of entity-groups it operates on. This is due to the commit overhead of local transactions, and network overhead due to distributed locks.
Key-value stores provide atomic updates of any key. However, many applications that use key-value stores would like to execute transactions that involve multiple keys, rather than just update one key. Distributed transactions on key-value stores exhibit different bottlenecks depending on the degree of lock contention between transactions. Lock acquisition overhead dominates the transaction execution time when the contention for locks is low. Lock-waiting time dominates the transaction execution time when the contention for locks is high.
Locks are typically implemented by augmenting data objects with an “isLocked” field. Using the atomic read-modify-write operation that is natively supported by the key-value stores, a data object can be locked by changing its “isLocked” field value. A transaction acquires the lock on the object before updating it. Although this approach is simple and efficient for modifying a single object, it is not suitable for multi-object distributed transactions. In a scale-out key-value store, data is routinely re-partitioned and re-assigned to other nodes to balance load across the cluster. Such movement of data scatters the locks on different nodes (lock dispersion), significantly increasing lock acquisition overhead. Moreover, re-partitioning of a set of data objects by the underlying key-value store is heavily influenced by their total size, which includes the size of all previous versions of the objects, the size of their indices, and the size of their caches. By grouping data objects and their locks, unnecessary movement of locks is triggered even though these locks do not contribute significantly to the size of the data objects.
Furthermore, lock acquisition is sequential and synchronous (locks must be acquired one by one, in a pre-defined order, to prevent deadlocks), unlike data-updates and lock-releases, which can happen in parallel, and asynchronously.