All of the material in this patent application is subject to copyright protection under the copyright laws of the United States and of other countries. As of the first effective filing date of the present application, this material is protected as unpublished material. However, permission to copy this material is hereby granted to the extent that the copyright owner has no objection to the facsimile reproduction by anyone of the patent documentation or patent disclosure, as it appears in the United States Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
Not Applicable
This invention relates to distributed computer systems and more particularly to an improved system and method for movement of services among nodes.
Distributed computer systems can store enormous amounts of information that can be accessed by users for identification and retrieval of valuable documents that contain data, text, audio and video information. A typical example of a distributed system (100) is shown in FIG. 1. This invention applies to distributed applications (104a to 104n and 112a to 112x) running on a distributed computer system. A distributed computer system consists of computer nodes (102a to 102n and 106a to 106x) and a communication network (114) that allows the exchange of messages between computer nodes. A distributed application (104a to 104n and 112a to 112x) is a program running on multiple nodes in the distributed system, which work together to achieve a common goal (e.g., a parallel scientific computation, a distributed database, or a parallel file system). In addition, shared disk storage (108a to 108z) may be available for the storage of data by the computer nodes.
Certain operations of a distributed application may require coordination between all of the participating nodes. A common technique for implementing such operations is to appoint a single xe2x80x9ccoordinatorxe2x80x9d node (110) that performs all operations that require such coordination. If another node needs to execute one of these operations, it does so by sending a message to the coordinator node, which will perform the operation on behalf of the other node and then send a reply with the result of the operation to the requesting node.
In order to make a distributed application fault tolerant, another node must be able to take over the service provided by the coordinator node in case the coordinator fails. In order to take over the service, the new coordinator node may need to rebuild state that was maintained by the old coordinator. Well-known techniques for rebuilding the necessary state include the use of stable storage (e.g., dual-ported disks or network attached, shared disks), and the collection of information from the other nodes in the system as well as disk striping of rebuild information, RAID arrays or the equivalent.
Often it is desirable to move coordinator services even when there are no failures. For example, when using a primary/backup scheme for fault tolerance (see FIG. 3), if the primary had failed (302) and the backup node is acting as coordinator (304), then when the primary begins to recover (306) and then becomes available again (308), it is desirable to move coordinator services from the backup node back to the primary node (310). Another example is the migration of coordinator services between nodes in order to balance the CPU load or other resource usage among the available nodes. Although it would be possible to use fail-over code to force a coordinator function to move from a node N1 to a node N2, this may be disruptive to the distributed application for a variety of reasons; the application may even be forced to shut down and restart on N1 and/or on other nodes.
Previously known methods for migrating coordinator services only work under certain restrictive assumptions about the distributed application and/or the coordinator services it uses namely:
1. If it is possible to interrupt and xe2x80x9ccleanlyxe2x80x9d abort pending coordinator operations without disrupting other parts of the distributed application on the same node, then it is possible to migrate services from one node to another in very much the same way as in case of a node failure. However, this assumes that coordinator operations do not share any data structures with other parts of the distributed application, or at least that it is possible to undo the effect of a partially completed operation so that the shared data structures can be restored to a consistent state. Furthermore, this approach only works if such undo actions do not require invoking additional coordinator operations.
2. If there are no dependencies between coordinator operations, then it is possible to suspend all new operations, wait for all pending operations to complete, and then migrate coordinator services to another node without disruption. For example, the IBM(copyright) Recoverable Virtual Shared Disk (RVSD) product for the RS/6000(copyright) SP allows migrating a virtual disk server from a primary server node to a backup server node and back. It does so by suspending all VSD requests (requests to read and write a disk block) prior to migrating a VSD server to the other node. This approach works because each disk I/O request is independent (completing one disk I/O request does not require the VSD server to issue additional I/O request or any other requests to other services). If operations are not independent, this approach can deadlock.
3. If there are dependencies between coordinator operations, it may be possible to group these operations into a set of distinct services such that operations belonging to the same service are independent of each other. In this case it may be possible to use traditional methods to migrate each of these services independently, one at a time.
Although these prior art migration services are useful, all of the above methods have their shortcomings. A shortcoming of method two described above is that it cannot operate for services that are interdependent. Method one cannot be applied when the application does not allow pending operations to simply be aborted. Method three cannot be used when these services cannot be migrated one at a time, for example, because all of the services depend on some common infrastructure that cannot easily be split along service boundaries. Accordingly, a need exists for migration services that can operate when the services are interdependent.
Another shortcoming with known methods for migrating services is the inability not only to handle services that are directly interdependent but indirectly interdependent as well. For example in order to process an operation OP1, a service may need to invoke another service operation OP2; hence OP1 directly depends on OP2. And an example of an indirect dependency would be if operation OP1 required a shared resource, e.g., a lock, that might be held by an unrelated operation OP3 on the same or another node; OP3 might need to invoke a service operation OP4 before it can release the resource it is holding. In this example, OP1 would indirectly depend on OP4. The prior art method of simply suspending all new operations would lead to deadlock since OP1 cannot compete until OP4 completes. Accordingly, a need exists for migration services that can operate when the services are not only directly interdependent but indirectly interdependent as well.
This invention allows a related set of coordinator services to migrate from one node to another without disrupting applications on any of the nodes in the system. Unlike other methods, this invention allows a set of interdependent services to be quiesced and migrated together. Since service operations depend upon the results and/or data of other service operations, any particular operation can only complete properly when those other operations return data necessary for the completion of the dependent operation. Therefore, this invention permits the non-disruptive migration by phasing the xe2x80x9cquiescencexe2x80x9d of the services. Operations that are most dependent upon other operations are suspended before those other operations; then the process waits for any current operations to complete. Once the first phase of dependent operations have completed, the next phase of dependent operational services are suspended, the process waits for completion of those operations and so on until there are no more phases to the xe2x80x9cquiescencexe2x80x9d process. Hence the invention is applicable to more complex distributed applications than previous methods.