1. Field of the Invention
The present invention generally relates to operating systems for fault-tolerant distributed computing systems. More particularly, the present invention relates to a system and method that supports asynchronous I/O requests that can switch to a secondary server if a primary server for the I/O request fails.
2. Related Art
As computer networks are increasingly used to link stand-alone computer systems together, distributed operating systems have been developed to control interactions between multiple computer systems on a computer network. Distributed operating systems generally allow client computer systems to access resources or services on server computer systems. For example, a client computer system may access information contained in a database on a server computer system. However, when the server fails, it is desirable for the distributed operating system to automatically recover from this failure without the user client process being aware of the failure. Distributed computer systems possessing the ability to recover from such server failures are referred to as "highly available systems," and data objects stored on such highly available systems are referred to as "highly available data objects."
To function properly, a highly available system must be able to detect a failure of a primary server and reconfigure itself so that accesses to objects on the failed primary server are redirected to backup copies on a secondary server. This process of switching over to a backup copy on the secondary server is referred to as a "failover."
Asynchronous I/O requests are particularly hard to implement in highly available systems. Asynchronous I/O requests allow a process to initiate an I/O request and continue processing while the I/O request is in progress. In this way, the process continues doing useful work--instead of blocking--while the I/O request is in progress, thereby increasing system performance. Unfortunately, a process typically has little control over when the I/O request completes. This lack of control over the timing of I/O requests can create problems in highly available systems, which must be able to recover from primary server failures that can occur at any time while an asynchronous I/O request is in progress.
What is needed is a highly available system that supports asynchronous I/O requests that can switch to a secondary server if a primary server for the I/O request fails.