A storage system is a computer that provides storage services relating to the organization of information on writeable persistent storage devices, such as non-volatile memories and/or disks. The storage system typically includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of data containers, such as files and directories on, e.g., the disks. Each “on-disk” file may be implemented as set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be realized as a specially formatted file in which information about other files and directories are stored.
The storage system may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files and directories stored on the system. In this model, the client may comprise an application executing on a computer that “connects” (i.e., via a client connection) to the storage system over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. Each client may request the services of the storage system by issuing file system protocol messages or requests, such as the conventional Network File System (NFS) protocol requests, to the system over the client connection identifying one or more files to be accessed. In response, the file system executing on the storage system services the request and returns a reply to the client.
Broadly stated, the client connection is provided by a process of a transport layer, such as the Transmission Control Protocol (TCP) layer, of a protocol stack residing in the client and storage system. The TCP layer processes establish the client (TCP) connection in accordance with a conventional “3-way handshake” arrangement involving the exchange of TCP message or segment data structures. The resulting TCP connection is a reliable, securable logical circuit that is generally identified by port numbers and Internet Protocol (IP) addresses of the client and storage system. The TCP protocol and establishment of a TCP connection are well-known and described in Computer Networks, 3rd Edition, particularly at pgs. 521-542.
Many versions of the NFS protocol utilize reply caches for their operation. A reply cache may serve many purposes, one of which is to prevent re-execution (replay) of non-idempotent operations by identifying duplicate requests. By caching reply information for such operations, replies to duplicate requests may be rendered from cached information, as opposed to re-executing the operation with the file system. For example, assume a client issues an NFS request to the storage system, wherein the request contains a non-idempotent operation, such as a rename operation that renames, e.g., file A to file B. Assume further that the file system receives and processes the request, but the reply to the request is lost or the connection to the client is broken. A reply is thus not returned to the client and, as a result, the client resends the request. The file system then attempts to process the rename request again but, since file A has already been renamed to file B, the system returns a failure, e.g., an error reply, to the client (even though the operation renaming file A to file B had been successfully completed). A reply cache attempts to prevent such failures by recording the fact that the particular request was successfully executed, so that if it were to be reissued for any reason, the same reply will be resent to the client (instead of re-executing the previously executed request, which could result in an inappropriate error reply).
Another purpose of the reply cache is to provide a performance improvement through work-avoidance by tracking “in-progress” requests. When using an unreliable transport protocol, such as the User Datagram Protocol (UDP), the client typically retransmits a subsequent NFS request if a response is not received from the storage system upon exceeding a threshold (e.g., one second) after transmission of an initial NFS request. For an NFS request containing an idempotent operation having a large reply, such as read or readdir operation, the actual processing of the request by the file system could exceed this threshold for retransmission. Such in-progress requests are tracked so that any duplicate requests received by the system are discarded (“dropped”) instead of processing duplicate file operations contained in the requests. This work-avoidance technique provides a noticeable performance improvement for the NFS protocol over the UDP protocol.
A known implementation of an NFS reply cache is described in a paper titled Improving the Performance and Correctness of an NFS Server, by Chet Juszczak, Winter 1989 USENIX Conference Proceedings, USENIX Association, Berkeley, Calif., February 1989, pgs 53-63. Broadly stated, this implementation places reply cache entries into a “global least recently used (LRU)” data structure, i.e., a list ordered by a last modified time for each entry. In response to processing of a new NFS request from a client, a protocol server, e.g., an NFS server, executing on the storage system removes the oldest (thus, least recently used) entry from the list, clears its reply data and assigns the entry to the new request (thus invalidating the old cache entry). The reply cache implementation accords equal weight to all cached NFS replies and cache management is predicated on maintaining a complete record of the most recent replies in the reply cache using an LRU algorithm.
In general, clients utilizing the NFS protocol over the TCP protocol can retransmit NFS requests (if responses are not received from the storage system) a substantially long period of time after transmission of their initial requests. Such long retransmit times often result in active clients “starving” slower/retransmitting clients of entries in the reply cache, such that it is unlikely that a retransmitted duplicate non-idempotent request (in a deployment using NFS over TCP) will be found in a global LRU reply cache. The ensuing cache miss results in a replay of the non-idempotent operation and, potentially, data corruption.