In a network environment, it is critical for highly available applications, such as electronic mail applications, to transfer files among computers within the network quickly and reliably. For this reason, such highly available applications often require failures in transferring files to be detected very quickly. If a computer that is a source of a file being transferred fails during the transfer, the highly available application will usually attempt to access the file from another source computer in the network or begin a recovery procedure if no other source computer is available in the network. Typically, a network file sharing implementation is utilized by the application to transfer files from a source computer to a destination computer.
Known network file sharing implementations use a connection based transport protocol between the client and server. Over such a connection, a file sharing protocol makes requests between the client and server as operations are performed. For example, a client may request that a file be copied from the server to the client via a file share. The file sharing implementation includes timeouts and behaviors to detect when the server is no longer available. Typically, these timeouts are between 30-45 seconds in order to allow the server time to respond when under load and to minimize the network traffic.
When a highly available application is built on top of the file sharing protocol, the application must accept the failure detection time of the file sharing implementation. Additionally, highly available applications often implement multiple file shares to avoid a single point of failure of the application. By implementing redundant file shares, failure detection time is undesirably extended because the application must try each file share before detecting a failure.
For example, suppose the highly available application can copy a file from two different source computers on the network. The application requests the file from the first source computer and, about 30 seconds later, receives a timeout indicating the first source is unavailable. Next, the application will request the file from the second source computer. Unfortunately, the application will have to wait another 30 seconds (60 seconds total) to determine if the second computer is also unavailable. Therefore, the highly available application utilizing the file sharing protocol cannot detect a failure at a desired speed (e.g., within 2-5 seconds).
One possible solution to the problem requires the highly available application to monitor all of the source computers in the network. To achieve a desired 2-5 second timeout, the application would be required to send a message or other communication to all source computers approximately every 0.5 seconds. However, this type of complex monitoring requires a great deal of overhead and must be very efficient to avoid degrading the network and the server executing the highly available application. The network degradation will be amplified if more than one server is executing the highly available application since each instance of the application will need to monitor all source computers.
Another possible solution to the problem requires the highly available application to implement a new file sharing protocol with a shorter timeout. Unfortunately, this would also greatly increase the complexity and overhead involved in executing the highly available application. In this instance, the highly available application would not be extensible because a new file sharing protocol would have to be implemented every time a new file system or network protocol is introduced on a source computer. Additionally, implementing a new protocol would be a waste of resources since most networks implement some form of file sharing.