Complex procedures are the norm in today's interconnected computing world. A “complex” procedure is one whose successful completion requires the successful completion of a number of separate steps. If any one of these separate, constituent steps fails, then the procedure as a whole also fails.
Advances in communications allow various steps of a complex procedure to be performed on various computing devices. In some cases, the use of multiple computing devices is inherent in the procedure itself, as when a client requests a resource from a resource server. This common scenario becomes even more complex if the resource server asks an authentication server to verify the client's credentials before fulfilling the request. The client's request will fail if any one of the devices, the client, the resource server, or the authentication server, fails to perform its part of the transaction. The client's request can also fail because of a communications failure in the networks connecting these three devices.
In other cases, the complex procedure can be performed on a single computing device, but multiple devices are invoked to speed the procedure. For example, an intensive mathematical computation is broken into steps, and the steps are distributed to individual servers. As one hedge against possible failure, the same computational step can be distributed to a number of servers. In any case, a failure of one step causes the entire computation to fail or, in the case of redundant servers, can slow down the production of the final result.
The potential for trouble in a multi-step procedure increases when the procedure involves multiple databases. Here, a failure can not only prevent a client's database request from being fulfilled, but can also leave the databases in inconsistent states, i.e., “unsynchronized.” For a simplified example, consider a computing environment with two resource servers and a directory server that directs client resource requests to the appropriate resource server. Moving a resource from one resource server to another (in order to, for instance, balance the load of requests between the resource servers) involves the updating of both of the resource servers and of the directory server. A failure in the multi-step resource movement procedure could leave the directory server directing client requests to a resource server that no longer has, or does not yet have, the appropriate resource.
Techniques have been developed to mitigate failures in multi-step procedures that, like the situation given above, involve multiple databases. If all of the computing devices involved in the procedure use the same type of database, then a well known “two-phase commit” process can be invoked. The two-phase commit is designed to keep the databases synchronized at their pre-procedure state if an error occurs at any time during the procedure. In the first phase, each of the databases involved receives an update command. A transaction monitoring system then issues a “pre-commit” message to each database. If a database can successfully perform the update, then it temporarily stores the update and acknowledges the pre-commit command. If the transaction monitor receives acknowledgements from all of the databases involved, then it issues to them a “commit” message. Upon receiving the commit, each database makes the temporary change permanent. The procedure has been successfully performed, and the databases are now synchronized in their post-procedure state. If, on the other hand, the transaction monitor does not receive all of the expected acknowledgements, then the multi-step database update procedure has failed, and the temporary changes at each database are discarded. While the procedure has failed, the databases are synchronized in their pre-procedure state. Because of this synchronization, it is possible to either safely retry the multi-step database update procedure or to safely abandon the attempt.
Useful as it is, the two-phase commit only applies to a limited scope of procedures. It only works well if all of the servers involved use the same type of database. (It can be implemented across different database types, but at a significant increase in cost and complexity.) It also does not work where the multi-step procedure calls for changes to data structures other than databases.
Another useful technique for managing errors during a multi-step procedure is the “rollback.” Here, for each step of the multi-step procedure a method is developed for “rolling back,” or undoing, the results of that step. When a step fails, its partial results are rolled back. The results of previous, successfully performed steps can also be rolled back. This continues with all of the involved devices until they are all in their pre-procedure state. Then, just as in the case of a failed two-phase commit, it is safe to either retry the multi-step procedure or to safely abandon the attempt.
The rollback procedure, though in theory more widely applicable than the two-phase commit, has its own serious drawbacks. First, just like any other step in the multi-step procedure, each rollback step can itself fail. To handle this, a method of rolling back each rollback step is developed. This illustrates the second drawback: Adding rollback steps to a multi-step procedure complicates an already complicated scenario. This additional complication increases both the development and the processing costs of the multi-step procedure and may actually decrease the overall probability of the procedure's success.