The invention relates to computer systems, and more particularly to a method and mechanism for deadlock resolution.
Entities often need to access one or more resources to perform work in a computing system. Examples of such entities include processes, tasks, and threads. In modern computing and database systems, at any moment in time, there may be an extremely large number of concurrent entities that seek to access the known resources in a system. However, conflicts may arise if multiple entities are permitted to perform incompatible accesses to the same resources at the same time. For example, if two entities are permitted to write to the same piece of data at the same time, then possible errors or coherency uncertainties could arise with respect to the status or content of that piece of data. The issue of coherency and access conflicts becomes even more complex in a multi-instance database system that allows common access to a single database across multiple networked nodes, such as occurs with the Real Application Clusters (RAC) product available from Oracle Corporation of Redwood Shores, Calif.
To prevent such conflicts from occurring in a computing system, various mechanisms may be implemented to manage the type, number, and/or ordering of accesses that are permitted to resources in the system. A common mechanism that is used to synchronize and manage access to resources in computing and database systems is referred to as a “lock”. A lock is a data structure that indicates whether or which particular entities have been granted rights to a resource. An entity must acquire a lock to a resource before the entity is permitted to access the resource.
The scope of possessory or access rights granted to an entity for a particular resource is often related to the type of work that the entity intends to perform upon that resource. For example, an “exclusive lock” could be granted to an entity that seeks to access a data item in a way that is incompatible with concurrent access by other entities, e.g., to modify, write or delete the data item. The exclusive lock therefore grants exclusive access to the data item, which prevents other entities from being able to concurrently access the same data item at the same time. This type of lock essentially serializes access to its corresponding resource. A “shared lock” could be granted if an entity wishes to perform activities upon a resource which can also be concurrently performed with activities by other entities upon the same resource without introducing conflicts or inconsistencies to the data, e.g., to read a data item. Therefore, the shared lock can be concurrently granted to multiple entities for the same resource at the same time. Depending upon the exact configuration of the computing or database system, other types of locks and lock scopes can be implemented to manage access to data.
The combination of locks granted for a resource is generally managed to avoid allowing incompatible activities upon that resource. For example, if an exclusive lock has been granted to a first entity for a data item, then no other lock requests are normally granted to that same data item until the first entity has completed its work and released the exclusive lock. All other lock requests, and their corresponding data access activities, are placed on hold until the lock requests are granted. If a shared lock has been granted to one or more entities for a data item, then subsequent requests for a shared lock upon the same data item can be concurrently granted. However, a subsequent request for an exclusive lock will be placed on hold until the previously granted shared locks have been released.
The occurrence of a “deadlock” is a problem that could significantly affect the orderly granting and releasing of locks, and therefore the orderly access of resources, within a computing system. A deadlock occurs within a set of entities when each entity in the set is waiting for the release of at least one resource owned by another entity in the set.
For an example of a deadlock, consider the resource management situation shown in FIG. 1a. This figure shows an example approach for implementing locks in a computing system, in which every resource (e.g., in a database cache) is associated with a lock structure having both a request queue and a grant list to identify “waiters” and “owners” for that resource. As shown in FIG. 1a, a first lock structure 100 is associated with a first resource R1. A second lock structure 101 is associated a second resource R2. Each lock structure corresponds to a granted lock list and a lock request queue. Thus, lock structure 100 is associated with a lock grant list 102 that identifies that an entity P1 presently owns a shared lock 102a to resource R1. Lock structure 100 is also associated with a lock request queue 104 that contains a first request 104a for an exclusive lock for an entity P2 and a second lock request 104b for a shared lock for entity P3. For resource R2, lock structure 101 is associated with a lock grant list 106 that identifies that an exclusive lock 106a has already been granted to entity P3. Lock structure 101 is also associated with a lock request queue 108 containing a lock request 108a from entity P1 for a shared lock to resource R2.
Entity P1 already holds a shared lock 102a to resource R1, but needs to acquire a shared lock to resource R2 before it can complete its work. In this situation, it is assumed that P1 will not normally release its lock to R1 until it has completed its work (e.g., until P1 has been able to also access resource R2). However, P1 is unable to immediately acquire a shared lock to R2 since entity P3 already holds an exclusive lock 106a to resource R2. Therefore, P1 needs to wait until P3 releases its exclusive lock 106a to R2 before P1 can acquire its desired lock to R2.
To complete its work and release its exclusive lock 106a to resource R2, P3 needs to access a resource R1, as indicated by its request 104b to acquire a shared lock. The lock request queue 104 contains a prior lock request 104a from entity P2 to acquire an exclusive lock to resource R1. The prior lock request 104a for an exclusive lock cannot be granted since entity P1 already holds a shared lock 102a to R1. However, P1 will not release its shared lock 102a until it has been granted its lock request 108a and given access to R2.
A deadlock situation exists since lock request 108a cannot be granted until P3 releases its exclusive lock 106a to R2. However, P3 will not release its exclusive lock 106a until it completes its work, which requires lock request 104b to be granted. Lock request 104b cannot be granted since it is blocked behind lock request 104a in lock request queue 104, and lock request 104a cannot be granted until P1 releases its lock 102a to R1. Coming back to the beginning of this circular deadlock, P1 cannot release its lock 102a to R1 until lock request 108a has been granted. Because P1, P2, and P3 are waiting for locks to be released before completing their work, but the locks cannot be granted to each other and the entities cannot proceed with work unless the others release one or more resource(s), they are deadlocked. This deadlock is symbolically shown in FIG. 1b, in which “P1—>P3” means that entity P1 is being blocked by entity P3. In this deadlock situation, P1 is being blocked by the exclusive lock 106a held by P3. P3 is being blocked by the lock request 104a for an exclusive lock by P2. P2 is blocked by the shared lock 102a owned by P1.
Various detection and resolution techniques have been developed to address deadlock situations. For example, many deadlock handlers employ the “cycle” or “time out” techniques to detect deadlocks. In this approach, after a process waits a threshold period of time for a resource, a deadlock is presumed to exist or a wait-for graph is generated and examined for any cycles. If any cycles are identified or if the threshold time is exceeded, then a possible deadlock has been detected. At this point, a deadlock resolution technique could be applied to eliminate the deadlock, e.g., by timing out or “resetting” some or all of the resources, locks, and/or entities in the system.
However, existing deadlock detection and resolution techniques cannot adequately resolve deadlocks that occur across different classes of locks/resources and/or in a clustered data environment for database systems. In these situations, access to the different classes of resources may be managed by unconnected or orthogonal lock spaces/lock management structures. Consider a database system that has different classes of resources. A first example class of resources (referred to herein as “row cache” data) may be system/database metadata, which is data that describes, defines, or manages the fundamental structures and data types used to store and access data in the database, e.g., definitional data that defines the configuration of tables in a database. A first set of lock structures may be used to manage access to row cache data. A second example class (referred to herein as “buffer cache” data) may be the actual data that is stored in the structures of the database, e.g., data stored in database tables. A second set of lock structures/lockspaces may be used to manage access to the buffer data. Since the two lock spaces are generally unrelated, a conventional deadlock handling mechanism does not have the background knowledge of the locks and/or resources in the different lock spaces to even detect the deadlock, much less coordinate the locks across the different lock spaces to resolve the deadlock. The problem is further exasperated in clustered environments in which a single database can be “virtually” spread across multiple nodes that are networked together. In this environment, lock management structures on the distributed nodes may be employed to manage the resource locks. Spreading data and lock management structures across multiple nodes makes it even more difficult for conventional deadlock resolution techniques to identify and resolve deadlocks.
Accordingly, the present invention provides a method and system for using a requeueing procedure to resolve deadlocks in a computing system. In one embodiment of the invention, a request for a resource may be requeued after a designated period of time or wait cycles if it is blocked from being granted. For example, in one embodiment, a request for exclusive ownership of a resource could be requeued if it cannot be granted within an appropriate period of time. With lock requeueing, the requests for locks associated with the resources are requeued to allow other requests for the same resource to move ahead in the wait queue. This allows other grantable requests behind the blocked request to be immediately granted. Using this approach, it is possible that allowing the other requests behind the timed-out request to move ahead in the queue will set off a chain reaction of accesses to resources which will clear the deadlock situation that initially causes the requeued request(s) to be blocked. Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.