As generally known, a remote procedure call (RPC) allows a subroutine or procedure to execute in another address space (typically on another computer—a server, for example—on a shared network). A client initiates an RPC request (e.g., for data) by sending a message to a remote server in order to execute the specified subroutine or procedure. Data from the remote server is then typically returned to the client.
A problem with remote procedure calls is that the RPC request can fail because of network problems that are unknown to the client (e.g., a transmission problem). Consequently, the client may not know whether or not the RPC request was actually received or invoked.
In an attempt to address this problem, a timeout period is associated with each RPC request. If a timeout period associated with a particular RPC request expires before a response to that RPC request is received, then the client may reissue the RPC request and restart the timeout period. The timeout periods are usually tens of seconds (e.g., 30 seconds or more) in length. Actually, RPC requests may have two types of timeout periods, one for retrying an RPC request, and one for failing an RPC request. As such, an RPC request may not fail until after it has been retried several times, which may be on the order of minutes.
The server to which the client is sending the RPC request may be inoperative but, as mentioned above, the client may not be aware of that. Hence, the client will continue to send RPC requests to that server until either a network administrator (human or machine), or the client itself, realizes that the server is inoperative. Because of the length of the timeout period, it may take a relatively long time for this realization to occur, which can significantly affect the client's performance in a negative way. If the client has multiple outstanding RPC requests, the effect on performance can be even more severe, because the client typically must wait for all of the RPC requests to time out. Moreover, execution on the client can be further delayed while the client waits until it is assigned to an alternate server that can handle the client's RPC requests.
In cluster file systems in particular, server node failures are traditionally handled by having network file system clients wait for their submitted RPC requests to time out and then retry the requests until the cluster server network address is taken over by a healthy node. This typically results in poor response times under server node failures because, as noted above, the client may have to wait for several outstanding requests to the failed server node to time out. The client also has to wait for the cluster network address to be taken over by a healthy server node before retrying requests. In addition, the client typically has to use a larger timeout value for its remote procedure call requests, to give the server sufficient time to reply and to avoid spurious timeouts that can further worsen the client's performance.