One of the long standing challenges in computing is the detection of deadlocks. A deadlock occurs if a set of entities exists such that each entity in the set is waiting for the release of at least one resource owned by another entity in the set. Entities capable of owning a resource are referred to herein as possessory entities. In the context of a database system, for example, possessory entities include processes and transactions. A transaction is an atomic unit of work.
For example, a transaction T1 may seek exclusive ownership of resources R1 and R2. If R1 is available and R2 is currently exclusively owned by anther transaction T2, transaction T1 may acquire exclusive ownership of R1 but must wait for R2 to become available. A deadlock will occur if transaction T2 seeks ownership of R1, and T2 is suspended to wait for R1 without releasing R2. Because both T1 and T2 are waiting for each other, they are deadlocked.
Computer systems employ a variety of deadlock handling mechanisms (deadlock handlers) that detect deadlocks. Many of deadlock handlers employ the “cycle” technique to detect deadlocks. In the cycle technique, after a process waits a threshold period of time for a resource, a wait-for graph is generated and examined for any cycles. If any cycles are identified, then the deadlock detection mechanism has identified a potential deadlock.
A wait-for graph is a graph that includes vertices that represent resources (“resource vertices”) and vertices that represent possessory entities (“entity vertices”). An arc from an entity vertex to a resource vertex represents that the respective possessory entity represented by the entity vertex is waiting for ownership of the resource. An arc from a resource vertex to a entity vertex represents that the resource represented by the resource vertex is owned by the possessory entity. A cycle is detected when a chain of arcs leads both to and from the same vertex.
FIG. 1 shows an exemplary wait-for graph 100, which includes entity vertices 111 and 112 and resource vertices 121 and 122. Wait-for graph 100 was generated when a deadlock handler detected that the process represented by entity vertex 111 had waited a threshold period of time for the resource represented by resource vertex 121. Arc 131 represents that the process represented by entity vertex 111 is requesting ownership of the resource represented by resource vertex 121. Arc 132 represents that the resource represented by resource vertex 121 is owned by the process represented by entity vertex 112. Arc 133 represents that the process represented by entity vertex 112 has requested the resource represented by resource vertex 122. Arc 134 represents that the resource represented by resource vertex 122 is owned by the resource represented by entity vertex Arcs 131, 132, 133, and 134 form a loop that both extends from and leads to entity vertex 111, and thus represents a cycle. The processes represented by the entity vertices of wait-for graph 100 are therefore potentially deadlocked.
In a distributed computer system, the resources and entities involved in deadlocks may be distributed among many nodes. Thus, in the example given above, transaction T1 may reside on one node, while transaction T2 resides on another node. Detecting deadlocks on distributed computer systems may involve generating “distributed wait-for graphs”. Distributed wait-for graphs are wait-for graphs that include entity vertices for entities that may be from many nodes.
Typically, the set of nodes that are executing the entities that may be involved in a deadlock cooperate with each other to generate the distributed wait-for-graph, each node producing the portion of the distributed wait-for graph that covers the node's respective entities. A process such as the deadlock handler is responsible for splitting the task of generating the distributed wait-for graph to each node of the set of nodes. Thus, generating distributed wait-for graphs involves identifying which nodes may be executing entities involved in a possible deadlock.
To determine which nodes may be involved in a possible deadlock on a distributed computer system, a deadlock handler may query all the nodes in the distributed computer system for information that indicates whether they may be involved in a deadlock. For example, assume that a deadlock handler in a distributed database system has detected that a distributed transaction has been waiting for a resource for a threshold period of time. A distrubted transaction is a transaction executed by database servers, which may reside on multiple nodes.
When a deadlock handler detects that the distributed transaction has been waiting a threshold period of time for a distributed resource, it may identify the multiple database servers involved in the distributed transaction through the broadcast query technique. In the broadcast query technique, the deadlock handler broadcasts a query to each database server in the distributed database system. The query requests information about whether the database server is involved in the distributed transaction.
Communication between database servers, especially those residing on different nodes, can involve a relatively large amount of overhead, and may substantially delay receipt by the detection handler of the information required to build the wait-for graph Often, the costs in overhead and delays is so great that deadlock handlers are configured to forego the cycle technique when attempting to detect deadlocks that may involve distributed resources. Instead, other deadlock detection techniques are used.
One common alternative to the cycle technique for detecting deadlocks is the time-out technique. Under the time-out technique, a possessory entity is presumed to be involved in a deadlock once the possessory entity waits a threshold period of time to obtain ownership of a resource. The time-out technique is less accurate in detecting deadlocks, since delays in obtaining ownership of a resource may result from many causes other than deadlock.
Based on the foregoing, it is desirable to provide a more efficient method of generating information about which nodes may have resources that are involved in a dead lock, and in more general, participants that may be involved in a distributed operation, such as a distributed transaction.