1. Field of the Invention
The present invention relates generally to distributed systems, and more specifically, to recovering allocated resources when a service using those resources fails.
2. Related Art
In a distributed system, such as an interactive television system, a distributed object combines the features of remote procedure calls and object-oriented programming. Interaction between components in the distributed system is implemented via one or more distributed objects. The distributed object (xe2x80x9cobjectxe2x80x9d) typically represents a service that is provided by a component in the distributed system. The object is defined by an interface which allows access to the service. The object hides the representation of its state and the methods used to implement its operations. The operations that are available on a particular object are defined in a specification for the particular object.
For example, a service might be a file located on a hard drive in a computer in the distributed system. An object representing the service has an interface that includes the operations of read and write. In order to access the service, a requestor of the service need not know the physical address of the file nor even the computer where it exists in the distributed system.
A requestor of a service is referred to as a client of the object providing the service. The object providing the service is referred to as a server of the requester. Typically, an object has both client and server relationships with other objects.
A server that provides a service makes the service available to a client by first creating an object reference that denotes a particular object representing an instance of the service. The object reference identifies the same object each time the object reference is used. A server accomplishes this by binding the object reference into a name service.
The name service makes the service available to a client when the client requests the service. The name service may do this using replicated contexts. Replicated contexts are discussed in greater detail in the above referenced copending application entitled xe2x80x9cSystem and Method for Transparently Exporting Replicated Services.xe2x80x9d
When a client desires to gain access to a particular service, the client requests that the name service provide the client with a specific object reference. This process is referred to as resolving a reference. The client requests a particular service and the name service provides the client with an object reference representing an instance of that service.
The client requests aspects of the service by invoking operations on the object reference obtained from the name service or some other service. Object references may be passed as parameters to operations and returned as the result of operations.
Object references are only valid as long as the implementor of the object reference (i.e., the particular component in the distributed system on which the server is running) is alive. If the implementor crashes or halts, the object reference becomes invalid. The client will detect this on the next attempt to use the object reference. The client will then be able to obtain another object reference to another object providing the same service.
Services normally represent resources (e.g. network connections) that are allocated to clients as objects. When a service allocates a resource to a client, the service creates an object that represents the resource and provides the object to the client. Under normal circumstances, when the client finishes using the resource, the client releases the object reference. This allows other clients to subsequently use the same resource. However, if the client fails (i.e., halts, crashes, etc.), the object reference remains valid and the resource remains tied to the failed client. Thus, the resource is not being utilized, and yet, remains unavailable to other clients desiring the resource. This is referred to as resource leakage.
Conventional systems use timeouts to recover resources allocated to failed clients. One type of conventional system sets a timeout as the expected length of time the resource would be utilized by the client. When this length of time has expired, the server reclaims the resource. In most cases, this length of time is set conservatively so that the resource is not reclaimed while the client is still utilizing it. In systems where client failure is frequent, this system is inadequate to stop resource leakage.
Another type of conventional system sets the timeout for a short period of time. In this system, the client must periodically reallocate the resource. In this type of system, reallocation requests may unnecessarily consume a large amount of network bandwidth. This approach is inadequate in a large distributed network having thousands of clients each utilizing multiple resources.
A third type of conventional system does not use a timeout. In this type of system, the server that allocates resources tracks the clients to which it has allocated resources. When the client exits or fails, the server recovers the resources. This particular approach, referred to as distributed garbage collection, also requires a significant amount of network bandwidth as each server must periodically check the status of the clients to which it has allocated resources. Thus, this approach is also inadequate in a large distributed network.
What is needed is a system and method for recovering resources in a distributed network that reduces resource leakage without requiring significant amounts of network bandwidth.
The present invention is a system and method for recovering resources in a distributed network. The present invention uses a resource audit service that establishes and maintains a status of one or more clients that have received resources from one or more services that allocate those resources. The allocating service is able to determine the status of the client through the resource audit service rather than by monitoring the client itself.
The resource audit service, together with a service controller, implements a callback operation associated with the client. When the allocating service allocates a resource to the client, the allocating service registers a callback with the resource audit service identifying the client as a recipient of the resource. The resource audit service subsequently monitors the client. Upon failure of the client, the resource audit service performs the callback to the allocating service notifying it of the failure of the client. After receiving the callback, the allocating service can recover the resource from the client.
One of the features of the present invention is that the allocating service does not have to directly monitor the status of each client to which it has allocated resources. The resource audit service provides a central location from which the status of each client can be determined.
Another feature of the present invention is that an allocating service is not required to poll or ping the client. For clients that otherwise may have long response times to polling or pinging, the resource audit service eliminates false determinations that the client has failed.
A further feature of the present invention is that the allocating service is released immediately after initiating a status check of the client with the resource audit service. After the status check is initiated, the resource audit service determines the status and notifies the allocating service regarding the failure of the client via a callback.
Yet another feature of the present invention is that the allocating service is able to determine the status of a non-local client through the resource audit service. In this case, the resource audit service polls a local instance of the resource audit service local to the non-local client. The local instance of the resource audit service returns the status of the non-local client to the local instance of the resource audit service. Thus, only the various instances of the resource audit service utilize network bandwidth to determine the status of various clients operating in the system. Ultimately, this reduces the overall flow of messages occurring in the network devoted to determining client status.
Still another feature of the present invention is the simple start up and recovery mechanism used by the resource audit service. This mechanism allows the resource audit service to start up and recover without remembering or being aware of the state of any clients or callbacks associated with the clients. This reduces the overall complexity of the entire distributed system.
Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.