A. Field of the Invention
This invention generally relates to data processing systems and, more particularly, to leasing for failure detection and recovery in data processing systems.
B. Description of the Related Art
Proper resource management is an important aspect to efficient and effective use of computers. In general, resource management involves allocating resources (e.g., memory) in response to requests as well as deallocating resources at appropriate times, for example, when the requesters no longer require the resources. In general, the resources contain data referenced by computational entities (e.g., applications, programs, applets, etc.) executing in the computers.
In practice, when applications executing on computers seek to refer to resources, the computers must first allocate or designate resources so that the applications can properly refer to them. When the applications no longer refer to a resource, the computers can deallocate or reclaim the resource for reuse. In computers each resource has a unique xe2x80x9chandlexe2x80x9d by which the resource can be referenced. The handle may be implemented in various ways, such as an address, array index, unique value, pointer, etc.
Resource management is relatively simple for a single computer because the events indicating when resources can be reclaimed, such as when applications no longer refer to them or after a power failure, are easy to determine. Resource management for distributed systems connecting multiple computers is more difficult because applications in several different computers may be using the same resource.
Disconnects in distributed systems can lead to the improper and premature reclamation of resources or to the failure to reclaim resources. For example, multiple applications operating on different computers in a distributed system may refer to resources located on other machines. It connections between the computers on which resources are located and the applications referring to those resources are interrupted, then the computers may reclaim the resources prematurely. Alternatively, the computers may maintain the resources in perpetuity, despite the extended period of time that applications failed to access the resources.
These difficulties have led to the development of systems to manage network resources, one of which is known as xe2x80x9cdistributed garbage collection.xe2x80x9d That term describes a facility provided by a language or runtime system for distributed systems that automatically manages resources used by an application or group of applications running on different computers in a network.
In general, garbage collection uses the notion that resources can be freed for future use when they are no longer referenced by any part of an application. Distributed garbage collection extends this notion to the realm of distributed computing, reclaiming resources when no application on any computer refers to them.
Distributed garbage collection must maintain integrity between allocated resources and the references to those resources. In other words, the system must not be permitted to deallocate or free a resource when an application running on any computer in the network continues to refer to that resource. This reference-to-resource binding, referred to as xe2x80x9creferential integrity,xe2x80x9d does not Guarantee that the reference will always grant access to the resource to which it refers. For example, network failures can make such access impossible. The integrity, however, guarantees that if the reference can be used to gain access to any resource, it will be the same resource to which the reference was first given.
Distributed systems using garbage collection must also reclaim resources no longer being referenced at some time in the finite future. In other words, the system must provide a guarantee against xe2x80x9cmemory leaks.xe2x80x9d A memory leak can occur when all applications drop references to a resource, but the system fails to reclaim the resource for reuse because, for example, of an incorrect determination that some application still refers to the resource.
Referential integrity failures and memory leaks often result from disconnections between applications referencing the resources and the garbage collection system managing the allocation and deallocation of those resources. For example, a disconnection in a network connection between an application referring to a resource and a garbage collection system managing that resource may prevent the garbage collection system from determining whether and when to reclaim the resource. Alternatively, the garbage collection system might mistakenly determine that, since an application has not accessed a resource within a predetermined time, it may collect that resource. A number of techniques have been used to improve the distributed garbage collection mechanism by attempting to ensure that such mechanisms maintain referential integrity without memory leaks. One conventional approach uses a form of reference counting, in which a count is maintained of the number of applications referring to each resource. When a resource""s count goes to zero, the garbage collection system may reclaim the resource. Such a reference counting scheme only works, however, if the resource is created with a corresponding reference counter. The garbage collection system in this case increments the resource""s reference count as additional applications refer to the resource, and decrements the count when an application no longer refers to the resource.
Reference counting schemes, however, especially encounter problems in the face of failures that can occur in distributed systems. Such failures can take the form of a computer or application failure or network failure that prevent the delivery of messages notifying the garbage collection system that a resource is no longer being referenced. If messages go undelivered because of a network disconnect, the garbage collection system does not know when to reclaim the resource.
To prevent such failures, some conventional reference counting schemes include xe2x80x9ckeep-alivexe2x80x9d messages, which are also referred to as xe2x80x9cping back.xe2x80x9d According to this scheme, applications in the network send messages to the garbage collection system overseeing resources and indicate that the applications can still communicate. These messages prevent the garbage collection system from dropping references to resources. Failure to receive such a xe2x80x9ckeep-alivexe2x80x9d message indicates that the garbage collection system can decrement the reference count for a resource and, thus, when the count reaches zero, the garbage collection system may reclaim the resource. This, however,can still result in the premature reclamation of resources following reference counts reaching zero from a failure to receive xe2x80x9ckeep-alivexe2x80x9d messages because of network failures. This violates the referential integrity requirement.
Another proposed method for resolving referential integrity problems in garbage collection systems is to maintain not only a reference count but also an identifier corresponding to each computational entity referring to a resource. See A. Birrell, et al., xe2x80x9cDistributed Garbage Collection for Network Objects,xe2x80x9d No. 116, Digital Systems Research Center, Dec. 15, 1993. This method suffers from the same problems as the reference counting schemes. Further, this method requires the addition of unique identifiers for each computational entity referring to each resource, adding overhead that would unnecessarily increase communication within distributed systems and add storage requirements (i.e., the list of identifiers corresponding to applications referring to each resource).
In accordance with the present invention, referential integrity is guaranteed without costly memory leaks by leasing resources for a period of time during which the parties in a distributed system, for example, an application holding a reference to a resource and the garbage collection system managing that resource, agree that the resource and a reference to that resource will be guaranteed. At the end of the lease period, the guarantee that the reference to the resource will continue lapses, allowing the garbage collection system to reclaim the resource. Because the application holding the reference to the resource and the garbage collection system managing the resource agree to a finite guaranteed lease period, both can know when the lease and, therefore, the guarantee, expires. This guarantees referential integrity for the duration of a reference lease and avoids the concern of failing to free the resource because of network errors.
In an alternative embodiment of the present invention, the leasing technique is used for failure detection and recovery. When using a lease for failure detection, a client requests a lease from a server, and after the lease is granted, the client performs various processing with respect to a resource managed by the server. When the lease is about to expire, the client renews the lease. If for any reason this renew fails, it is because either the server experienced an error or the communication mechanism transferring data between the client and the server experienced an error. In either case, the client has detected an error. Additionally, if the lease expires without the client renewing the lease or explicitly requesting a cancellation of the lease, the server knows that either the client or the communication mechanism experienced an error. In this case, the server has detected an error.
In addition to failure detection, the alternative embodiment also provides for failure recovers During the establishment of the lease, the client provides the server with a failure recovery routine, and likewise, the server provides the client with a failure recovery routine. Thus, upon detection of a failure, both the client and the server each invoke the failure recovery routine of the other to perform failure recovery for each other. After performing failure recovery, both the client and the server then go to a prenegotiated state. That is, the client and the server, through a negotiation beforehand, have decided upon a state that they will go to upon experiencing an error, such as rolling back all changes made to the resource. As a result, both the client and the server know the state of the system after a failure and can continue processing accordingly.