1. Field of the Invention
The present invention relates generally to distributed operating systems, and more particularly to method and an apparatus for ensuring proper semantics for operations when operations are restarted after a node failure in a highly available networked computer system.
2. Related Art
As computer networks are increasingly used to link stand-alone computer systems together, distributed operating systems have been developed to control interactions between multiple linked computer systems on a computer network. Distributed operating systems generally allow client computer systems to access resources on server computer systems. For example, a client computer system can usually access information contained in a database on a server computer system. However, when the server fails, it is desirable for the distributed operating system to automatically recover from this failure. Distributed computer systems possessing the ability to recover from such server failures are referred to as "highly available systems," and data objects stored on such highly available systems are referred to as "highly available data objects."
To function properly, the highly available system must be able to detect a server failure and to reconfigure itself so that accesses to objects on the failed primary server are redirected to backup copies on a secondary server. This process of switching over to a backup copy on the secondary server is referred to as a "failover."
One problem with restarting failed operations is that the primary server may have generated some external effects while performing an operation, and these effects may interfere with restarting the operations. For instance, consider a file system with a remove operation that removes a file from stable storage (e.g. disk) if the file exists, and otherwise returns an error. If the primary server fails after removing the file and the operation is restarted on the secondary server, the secondary server will find the file missing and will return an error. Thus, some operations cannot be simply repeated, i.e., the operations are not idempotent.
One solution to this problem is to send a checkpoint message from the primary server to the secondary server that contains enough data for a repeated operation to be performed correctly. For instance, in the previous example, the primary server can send a message to the secondary server stating whether or not the file exists. If the primary server fails and the secondary server receives the operation, the secondary server can check if the file exists. If the file does not currently exist, but the file existed on the primary server, the secondary server can conclude that the primary server completed the operation and can return success. Thus, the checkpointed data makes it possible for the secondary server to test if the operation completed on the primary server.
This approach will succeed if there is only one outstanding operation from the one or more clients. However, to improve system performance, it is often desirable to keep multiple operations in progress at one time for the one or more clients. In this case, operations may not be correctly restartable, even with the above-mentioned testing approach. For instance, suppose a first client sends a first operation to create a file while a second client sends a second operation to remove the same file. The primary server, when performing the first operation, will send a checkpoint to the secondary server saying the file does not currently exist. Suppose the primary server fails at this point and the operations are redirected to the secondary server. If the file create operation is restarted first, the secondary server will detect correctly that the primary server created the file and proper semantics will be preserved. However, if the remove operation is restarted first, the newly-created file will be successfully removed. Then, when the create operation is restarted, the secondary server will detect the absence of the file, will incorrectly conclude that the primary did not create the file, and will perform the create operation. In this case, the file will exist even though the remove operation apparently succeeded. This situation is a case of improper semantics.
Even if the operations are restarted in their original order, multiple operations can still cause problems. For instance, consider the three operations "rename file A to C," "rename file B to A" and "rename file C to B." If these three operations take place, files A and B will have traded places. If these three operations are restarted, the secondary server cannot simply test for the existence of files A and B to determine if the operations completed or not, since A and B will exist in either case. If the secondary server makes the wrong decision, the files A and B may be swapped twice or not at all. Thus, in the case where multiple operations occur simultaneously, making operations testable is not sufficient to ensure proper semantics. The above-mentioned problem does not arise for a single server because locking can be performed on the single server to ensure proper semantics on a single server. Providing such locking across multiple servers in a highly available system is possible, but it can greatly impede system performance.
What is needed is a method and an apparatus that ensures proper semantics when operations are restarted after a node failure in a highly available system.