1. Field of the Invention
The present invention relates generally to distributed systems, and more specifically, to recovering allocated resources when a service using those resources fails.
2. Related Art
In a distributed system, such as an interactive television system, a distributed object combines the features of remote procedure calls and object-oriented programming. Interaction between components in the distributed system is implemented via one or more distributed objects. The distributed object ("object") typically represents a service that is provided by a component in the distributed system. The object is defined by an interface which allows access to the service. The object hides the representation of its state and the methods used to implement its operations. The operations that are available on a particular object are defined in a specification for the particular object.
For example, a service might be a file located on a hard drive in a computer in the distributed system. An object representing the service has an interface that includes the operations of read and write. In order to access the service, a requestor of the service need not know the physical address of the file nor even the computer where it exists in the distributed system.
A requestor of a service is referred to as a client of the object providing the service. The object providing the service is referred to as a server of the requestor. Typically, an object has both client and server relationships with other objects.
A server that provides a service makes the service available to a client by first creating an object reference that denotes a particular object representing an instance of the service. The object reference identifies the same object each time the object reference is used. A server accomplishes this by binding the object reference into a name service.
The name service makes the service available to a client when the client requests the service. The name service may do this using replicated contexts. Replicated contexts are discussed in greater detail in the above referenced copending application entitled "System and Method for Transparently Exporting Replicated Services."
When a client desires to gain access to a particular service, the client requests that the name service provide the client with a specific object reference. This process is referred to as resolving a reference. The client requests a particular service and the name service provides the client with an object reference representing an instance of that service.
The client requests aspects of the service by invoking operations on the object reference obtained from the name service or some other service. Object references may be passed as parameters to operations and returned as the result of operations.
Object references are only valid as long as the implementor of the object reference (i.e., the particular component in the distributed system on which the server is running) is alive. If the implementor crashes or halts, the object reference becomes invalid. The client will detect this on the next attempt to use the object reference. The client will then be able to obtain another object reference to another object providing the same service.
Services normally represent resources (e.g. network connections) that are allocated to clients as objects. When a service allocates a resource to a client, the service creates an object that represents the resource and provides the object to the client. Under normal circumstances, when the client finishes using the resource, the client releases the object reference. This allows other clients to subsequently use the same resource. However, if the client fails (i.e., halts, crashes, etc.), the object reference remains valid and the resource remains tied to the failed client. Thus, the resource is not being utilized, and yet, remains unavailable to other clients desiring the resource. This is referred to as resource leakage.
Conventional systems use timeouts to recover resources allocated to failed clients. One type of conventional system sets a timeout as the expected length of time the resource would be utilized by the client. When this length of time has expired, the server reclaims the resource. In most cases, this length of time is set conservatively so that the resource is not reclaimed while the client is still utilizing it. In systems where client failure is frequent, this system is inadequate to stop resource leakage.
Another type of conventional system sets the timeout for a short period of time. In this system, the client must periodically reallocate the resource. In this type of system, reallocation requests may unnecessarily consume a large amount of network bandwidth. This approach is inadequate in a large distributed network having thousands of clients each utilizing multiple resources.
A third type of conventional system does not use a timeout. In this type of system, the server that allocates resources tracks the clients to which it has allocated resources. When the client exits or fails, the server recovers the resources. This particular approach, referred to as distributed garbage collection, also requires a significant amount of network bandwidth as each server must periodically check the status of the clients to which it has allocated resources. Thus, this approach is also inadequate in a large distributed network.
What is needed is a system and method for recovering resources in a distributed network that reduces resource leakage without requiring significant amounts of network bandwidth.