Computers are very powerful tools for accessing and storing vast amounts of information. Computer databases are a common mechanism for storing information on computer systems. A typical database is a collection of “tables” having “rows” with “columns” of information. For example, a database table of employees may have a row for each employee where each row contains columns designating specifics about the employee, such as the employee's name, address, salary, etc.
A database management system (DBMS) is typically provided as a software “layer” on top of the database itself (i.e., the data actually stored on a non-volatile storage device(s)). The DBMS controls and coordinates access to the database by other “client” software applications. Typically, all requests from clients to retrieve and store data in the database are processed by the DBMS. Thus, the client software applications may be viewed as a software layer on top of the DBMS with the DBMS being an intermediary software layer between the client applications and the database. A DBMS and the database it manages are often referred to collectively as just a “database system”.
In recent years, the need for client applications to be able operate on very large database datasets has spurred the development of large-scale distributed database systems. A large-scale distributed database system typically is a database system in which the DBMS and/or the database is/are distributed among multiple computer systems. Large-scale distributed database systems often support highly-parallel database data processing computation. Today, some large-scale distributed database systems manage between hundreds of gigabytes up to multiple petabytes of database data and are distributed over tens, hundreds, even thousands of computer systems.
Large-scale distributed database systems typically support only basic database functionality and may not support a full relational database model as a trade-off of being able to scale up to support highly-parallel client applications such as those that can be found in a some cloud computing environments. For example, some large-scale distributed database systems support only simple query syntax and do not provide full Structured Query Language (SQL) or join support. In addition, some of these systems provide only single atomic writes based on row locks and provide only limited transactional support as a trade-off for reduced overhead in supporting strongly consistent distributed transactions. Many of these systems include a distributed, column-oriented database. One example of a distributed, column-oriented database is Google's Bigtable. See F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Bruger, “Bigtable: A Distributed Storage System for Structured Data”, OSDI, 205-218, USENIX Association, 2006. An open-source example of a large-scale distributed database system is Apache HBase currently available from the Apache Software Foundation at the Internet domain hbase.apache.org.
Recently, in an effort to make it easier for developers of client applications to reason about the state of the large-scale distributed databases that the client applications read from and write to, solutions have been developed to provide support for multi-row ACID (Atomic, Consistent, Isolated, and Durable)-compliant transactions with snapshot isolation semantics (or just “multi-row transactions” for short). With snapshot isolation, typically all row reads from the database within a transaction “see” a consistent snapshot of the database that remains unaffected by any other concurrent transactions. Further, any row writes to the database within the transaction typically are committed to the database only if none of the row writes conflict with any concurrent write committed to the database since that snapshot. To provide snapshot isolation, some of these solutions store in the database multiple time-stamped versions of each data item, a technique known as Multi-Version Concurrency Control (MVCC). A potential benefit of MVCC is more efficient row reads because reading a data item from a row typically does not require acquiring a lock on the row. Further, MVCC may protect against write-write conflicts. For example, if multiple transactions running concurrently write to the same cell (e.g., row/column pair), at most one of the transactions will be allowed to commit its write to the cell. Google's Percolator system built on top of its Bigtable distributed database is one example of a large-scale distributed database system that provides support for multi-row transactions. See “Large-scale Incremental Processing Using Distributed Transactions and Notifications”, Daniel Peng, Frank Dabek, Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010, a PDF copy of which is currently available via HTTP at /research/pubs/archive/36726.pdf in the www.google.com Internet domain.
Some current solutions implement multi-row transactions with an additional software layer (transaction service) that executes on top of an existing large-scale distributed database system (e.g., HBase, Bigtable, etc.). In some cases, this is a design goal of such solutions to avoid requiring modifications to the existing systems. As a result, these solutions generally do not integrate locking functionality for implementing multi-row transactions into the underlying database system. Nor do these solutions typically employ a centralized global deadlock detection process as that may hinder horizontal scaling of the system. As a result, locks for implementing multi-row transactions may be explicitly maintained by the transaction service itself.
Current multi-row transaction services for large-scale distributed databases may implement multi-row transactions with a two-phase commit transaction protocol. During a transaction initiated by a client application, row writes within the transaction may be buffered until the client commits the transaction at which point the transaction service initiates the two-phase commit process. In the first commit phase of the transaction, the buffered row writes and associated lock metadata are atomically written to the database using row-level transactions provided by the underlying database system (e.g., HBase, Bigtable, etc.). The lock metadata is generated and used by the transaction service for detecting conflicts (e.g., write-write conflicts) between different transactions. In the second phase, assuming no other transactions conflict with the current transaction, the transaction service commits the current transaction by atomically modifying the lock metadata in the database for the current transaction using a row level transaction provided by the underlying database system.
Lock metadata of current transaction services may be stored in the database in non-volatile memories where it can persist in the case of a system failure (e.g., power outage). If lock metadata were to disappear between the two phases of commit, the transaction service might mistakenly commit two transactions that should have conflicted. In current systems, row writes during the first commit phase typically require a volatile-memory to non-volatile memory synchronization operation to ensure that associated lock metadata is actually persisted (i.e., stored in non-volatile memory) rather than just being stored in a volatile-memory-based write cache or other volatile memory where the metadata could be lost in the event of a failure. Volatile-memory to non-volatile memory synchronization operations often require physical movement of mechanical components (e.g., disk platters, read/write heads, etc.) of non-volatile storage device(s) making these synchronization operations much slower than volatile-memory-only synchronization operations. As a result, the requirement of current transaction services that lock metadata be persisted in the database, as well as adding to the size of the database, can increase the latency of transaction commit operations; perhaps to the point that is intolerable for some types of database tasks such as, for example, some online transaction processing tasks. This increased latency can be mitigated by increasing parallelism of the system at the expense of additional computer systems and associated management overhead. However, some users of large-scale distributed database systems may want support for multi-row transactions without having to incur additional expenses for scaling current systems to provide lower-latency commits.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.