A parallel database is a database that runs on more than one CPU (central processing unit). There are two kinds of parallel database systems: one kind is a single-node parallel database, which runs on a single symmetric multiprocessor (SMP). On an SMP, all CPUs share memory and disk. Another kind of parallel database system is a multi-node parallel database. Multi-node parallel databases run on a multiple number of nodes that do not share memory. Each node in the multi-node parallel database system can be an SMP or a single CPU. Unlike distributed databases, both single-node and multi-node parallel databases provide a single database image.
Single-node parallel databases typically scale to the number of CPUs supported by a single SMP machine. Today, these kinds of databases are widely used and supported by many vendors. SMPs generally can support up to a few dozen CPUs because of the limited capacity on a single SMP bus.
Multi-node parallel databases are more fault-tolerant. If one node dies, surviving nodes can keep the database available. Multi-node parallel databases are more scalable because the number of CPUs are not subject the limitation of a single SMP bus. Achieving better scalability in multi-node parallel database systems is a challenge, however, because sending messages between nodes is more expensive than referencing shared memory on an SMP machine.
One of the difficulties is generating timestamps to order events within a multi-node parallel database. A multi-node parallel database uses timestamps to track the sequence of changes made by different nodes to shared resources. For example, if two transactions change a common dictionary, the transaction with the more recent timestamp is the change that occurred after the change of a transaction with a less recent timestamp. During normal processing, the ordering of events is used to maintain consistency. During recovery time, the ordering is used to order redo records in a recovery log.
The problem of tracking the sequence of completed transactions is generally solved by marking every completed transaction with a consistently increasing serial number at the time the transaction completes. Later transactions will have serial numbers greater than earlier transactions, allowing transactions to be well-ordered. These serial numbers are often called timestamps, because they indicate when an event occurred within a computer system relative to other events.
Using a hardware clock for each node to generate these timestamps for each node, however, is problematic because physical devices are imperfect. Each local clock may be set to a different time, and some clocks may run faster than others. It is quite possible for the clock of one node to be running fast, and, as a result, its timestamps would have a greater value than those of another clock, even though the timestamps of the other clock were generated at the same physical time.
One way to avoid the problem of synchronizing the local clocks is to use a single global, hardware clock connected to every node in the multi-node parallel database system. However, in a database with many nodes, a single hardware clock requires custom-built hardware, adds cost, and limits the scalability of the entire system. Furthermore, many hardware systems today do not have such a global clock.
Another approach is to recognize that all the nodes in a multi-node parallel database system already communicate with one another by sending messages over the network. Thus, one node, called a global clock service, can be assigned the task of running a clock. When other nodes need a timestamp, the other nodes send a message to the global clock service, requesting a timestamp. Upon receipt of such messages, the global clock service would generate a timestamp, either by reading its hardware clock or, more easily, by incrementing a software-based serial number counter, and send the timestamp to the requester.
This approach works and is in common use, but it does have a substantial drawback. As the system gets larger, more nodes must communicate with the global clock server. As a result, more time is spent waiting for the global clock server to handle all the requests. Certain improvements to the global clock server approach can be made. However, the basic design is ultimately limited by the single global point of timestamp generation, which can become a performance bottleneck for the entire system.
A method to improve the performance of timestamp generation by avoiding the single point has been discussed in a classic article by L. Lamport, "Time, Clocks & the Ordering of Events in a Distributed System", 21 COMMUNICATIONS OF THE A.C.M. 558 (July 1978), incorporated herein by reference. In general, Lamport discloses a way to generate timestamps using a local clock, such as a local software-based counter, yet remain synchronized. According to Lamport's technique, every message sent between nodes bears a timestamp that indicates the current time of a local clock. When a node receives a "piggybacked" timestamp from another node which is running fast, the node receiving the timestamp would resynchronize its local clock forward to the faster time. This procedure ensures a partial ordering upon the distributed system. That is, all causes will have a lower timestamp than their effects. This is true because each transaction carries with it the most recent timestamp it has seen so far. By the time the timestamp is generated for the completed transaction, the timestamp will have a greater value than any of the prior transactions in the chain of messages.
In a distributed system that uses Lamport's method of synchronizing clocks associated with each node of a distributed system, each node must piggyback a timestamp in every message it sends to another node. With reference to FIG. 2, when a node is about to send a message to another node, the former node reads a timestamp from the associated clock (step 210) and piggybacks the timestamp to the message (step 220). At this point, the message may be sent to the other node (step 230).
When a message, containing a piggyback timestamp TS.sub.2, is received by a node (step 310), the node performs the steps shown in FIG. 3. First, the node inspects the clock associated with the node to determine a local time TS.sub.1 (step 320). Then, the node compares TS.sub.1 and TS.sub.2 (step 330). If TS.sub.2 indicates a more recent time than TS.sub.1 (step 340), then execution proceeds to step 350, otherwise the process terminates. In step 350, the node sets the time of its clock to be at least that of the timestamp. A simple way is to set the clock to a time equal to the timestamp.
This Lamport approach does not, however, indicate which causally unconnected events happen before the other. For example, if event A on node A did not cause event B on node B, and event B on node B did not cause event A on node A, then the timestamps assigned to events A and B by nodes A and B, respectively, will not necessarily reflect the actual sequence of events A and B.
To reflect the sequence of causally unrelated events, a total ordering is necessary. Any one set of total orderings can easily be derived from a partial ordering through simple arbitrary rules, such as granting certain nodes automatic priority for causally unconnected events. Although any derivable total ordering is sufficient to maintain the consistency of the concurrent database, users may have their own ideas about which casually unconnected event occurs before another. When they disagree, anomalous behavior results. This is a problem for multi-version databases.
A multi-version database is a database that manages concurrency control via versions and snapshots. A multi-version database stamps each version of data with a logical timestamp. When a process initiates a transaction to read an item of data from a multi-version database, the process generally does not obtain a lock to prevent other processes from concurrently modifying the data. Instead, the process reads a snapshot of the data at a particular point in time, determined by a timestamp generated at the beginning of the transaction. Consequently, the process might read information that is slightly older than the most current version, but the information is guaranteed to be consistent.
For example, consider a distributed database using Lamport's technique that implements a checking account. Suppose a husband makes a deposit in the checking account at his node and telephones his wife that the money is there. She then queries the checking account at her node to see how much money is there. As far as the database is concerned, these events are causally unconnected and it has no way of knowing that the snapshot time for the wife's transaction should be more recent than the timestamp for the husband's transaction. Technically, the husband's phone call to his wife violated a specification of Lamport's approach, because his telephone call to his wife did not piggyback a timestamp to the wife's node. If the snapshot time of the wife's balance inquiry is less recent than the timestamp of the husband's deposit, she would then not see the money deposited into the account, even though her husband had deposited it earlier in real time and told her about it. It is clearly desirable to reduce the amount of this kind of anomalous behavior in a database system.
Lamport recognized this problem and proposed to address it by mandating that every node keep a sufficiently accurate physical clock. This scheme is difficult to implement for database systems, because physical clocks are not reliable. Physical clocks run at different rates, they may be changed by an external user, and they require periodic resynchronization.
Another drawback with Lamport's method is that it is not fault-tolerant. In a multi-node parallel database, different nodes may share data stored on a non-volatile memory, such as a disk. Thus the disk becomes another medium in which anomalous behavior may occur. Typically, multiple nodes synchronize their write operations to disk with distributed locks. A node writing to a block of data to disk obtains an exclusive lock for the disk block, while a node reading a disk block obtains shared lock. Thus one can respect the causalities propagated via the disk by piggybacking timestamps in the lock messages according to the Lamport technique. However, this scheme only works if all nodes are alive. When a node dies, it may have advanced its local time way ahead of others and wrote that timestamp to disk. But this high local timestamp cannot be piggybacked to other nodes to propagate the causality. Consequently, when other surviving nodes read the data on disk (e.g., as part of recovery), it may unexpectedly encounter data in the future of its local time, violating causality that the Lamport technique guarantees.