Each item (e.g., column) in a multiversion database is versioned and stored at a server-assigned timestamp. Old versions of an item can be read, but are typically subject to garbage-collection limits. Timestamps are causally consistent so that reads of data at old timestamps are guaranteed to reflect a causally consistent view of the database. For example, if transaction T1 completes before transaction T2 starts, then the timestamp of T1 must be less than the timestamp of T2, even if the transactions are on separate machines and do not overlap in terms of the data they access. Moreover, transaction T2 is guaranteed to “see” the effects of T1, and any transaction that “sees” T2 will also see T1.
When a client reads data from a multiversion database, the read can either specify a timestamp or allow the database management system to select the read timestamp within a specified bound on staleness. Selecting a timestamp within a staleness bound requires locking and/or blocking in order to prevent ambiguous staleness calculations.
Previously, multiversion databases enabled the calculation of read timestamps by tracking the last time any change was made to a row of data. However, when the database tracks only the last time each row was modified, the algorithm for selecting a read timestamp must select a timestamp that is greater than the last time that any column for the row has changed. This artificially high lower bound on selecting timestamps can slow down or block reads unnecessarily. It therefore limits concurrent access to a single row.
In some embodiments, a multiversion database comprises a single replica, which is typically stored at a single geographical location. In other embodiments, a multiversion database is distributed to two or more replicas, which are typically stored at two or more distinct geographic locations. When a database is distributed across multiple machines, disclosed embodiments generally utilize some distributed time system. For databases that are distributed, it is important for the timestamps to be consistent across multiple servers at multiple locations in order to provide unique event ordering. However, clocks in a distributed system are not always synchronized. One way of obtaining synchronized timestamps is with a network time protocol (NTP) service. By using a NTP service, clients may attempt to synchronize themselves by periodically querying a time master that may respond with a timestamp packet. The time queries and responses from the time master may allow a client to estimate its clock phase and frequency error, and adjust clock rates accordingly.
Several forms of uncertainty may afflict such query-based measurements. For example, delays induced by scheduling, network congestion, interrupt coalescing, routing asymmetry, system overloads, and other causes can prove to be as unpredictable as they are asymmetric. Moreover, a typical NTP's notion of time synchronization may be fundamentally flawed because no master can be entirely trustworthy. Thus, it may be prudent for clients not to make time adjustments based solely on a single master's response.