File system access often needs to be optimized for constrained bandwidth and/or high-latency network paths. Transaction accelerators are known, such as those described in McCanne I.
Originally called a transaction accelerator in McCanne I and McCanne III, devices that perform transaction acceleration have subsequently been called by various other terms including, but not limited to, WAN accelerators, WAN optimizers, WAN optimization controllers (WOCs), wide-area data services (WDS) appliances, WAN traffic optimizers (WTOs) and so forth. In recent times, transaction acceleration has also been referred to by the alternate names: transaction pipelining, protocol pipelining, request prediction, application flow acceleration, protocol acceleration, and so forth. Herein, the term “WAN accelerator” is used to refer to such devices.
Recently, the industry has adopted the term “data deduplication” to refer to some processes of eliminating redundant data for the purposes of storage or communication. For example, storage products have been introduced where a storage target for backup software performs data deduplication to reduce the amount of storage required for backup that exhibit high degrees of redundancy, as described in McCanne II. Likewise, network systems, e.g., as described in McCanne I, McCanne II and McCanne III, have been introduced that perform data deduplication (among other functions) to reduce the amount of bandwidth required to transmit data along a network path.
WAN accelerators perform a set of optimizations with the goal of making the performance of a networked application running over a wide-area network (WAN) as close as possible to the performance it would obtain running over a local-area network (LAN). LAN communication is characterized by generous bandwidths, low latencies and considerable enterprise control over the network. By contrast, WANs often have lower bandwidths and higher latencies than LANs, and often provide limited controls to the enterprise IT operator because WAN links often traverse networks that are tightly controlled by a service provider thereby preventing the enterprise IT operator from modifying or introducing components within the closed, service-provider network.
Wide-area client-server applications are a critical part of almost any large enterprise. A WAN might be used to provide access to widely used and critical infrastructure, such as file servers, mail servers and networked storage. This access most often has very poor throughput when compared to the performance across a LAN.
There is considerable literature on the general problem and proposed solutions, such as the literature listed here:
[Amer02] Amer. et al., “File Access Prediction with Adjustable Accuracy”, In Proc. of the International Performance Conference on Computers and Communication, April 2002.
[Brewer00] E. Brewer, “Toward Robust Distributed Systems”, Invited talk at Principles of Distributed Computing, Portland, Oreg. July 2000.
[Feiertag71] Feiertag and Organick, “The Multics Input/Output System”, Third ACM Symposium on Operating System Principles, October 1971.
[Gilbert02] S. Gilbert and N. Lynch, “Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services”, SigAct News, June 2002.
[Griffioen94] Griffioen and Appleton, “Reducing File System Latency Using a Predictive Approach”, In Proc. of the 1994 Summer USENIX Conference, June 1994.
[Howard88] Howard, “An Overview of the Andrew File System”, In Proc. of the USENIX Winter Technical Conference, February 1988.
[Kroeger99] Kroeger and Long, “The Case for Efficient File Access Pattern Modeling”, In Proc. of the Seventh Workshop on Hot Topics in Operating Systems, March 1999.
[LBFS] A. Muthitacharoen, B. Chen, and D. Mazieres, “A Low-bandwidth Network File System”, in Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01), pp. 174-187 (Chateau Lake Louise, Banff, Canada, October 2001) (in vol. 35, 5 of ACM SIGOPS Operating Systems Review, ACM Press).
[Lei97] Lei and Duchamp, “An Analytical Approach to File Prefetching”, In Proc. of the 1997 Annual USENIX Conference, January 1997.
[McKusick84] McKusick, et al., “A Fast File System for BSD”, ACM Transactions on Computer Systems, Vol. 2(3), 1984.
[Nelson 88] Nelson, Welch, and Ousterhout, “Caching in the Sprite Network File System”, ACM Transactions on Computer Systems, 6(1), February, 1988.
[NFS] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and Implementation of the Sun Network File System”, In Proceedings of the Summer 1985 USENIX Conference, 1985.
[NFSv4] “NFSv4 Minor Version 1”, Internet Draft, draft-ietf-nfsv4-minorversion1-02.txt.
[Padmanabhan96] Padmanabhan and Mogul, “Using Predictive Prefetching to Improve World Wide Web Latency”, ACM SIGCOMM, Computer Communication Review 26(3), July 1996.
[Rand96] D. Rand, “The PPP Compression Control Protocol (CCP)”, Request-for-Comments 1962, June 1996.
[Rosenblum92] Rosenblum and Ousterhout, “The Design and Implementation of a Log-Structured File System”, ACM Transactions on Computer Systems, Vol. 10(1), 1992.
[Sandberg85] Sandberg et al., “Design and Implementation of the Sun Network Filesystem”, In Proc. of the Summer 1985 USENIX Conference, 1985.
[Spring00] Spring, N., et al., “A Protocol-Independent Technique for Eliminating Redundant Network Traffic”, in Proceedings of ACM SIGCOMM (August 2000).
[Tolia03] N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. Bressoud, and A. Errig, “Opportunistic Use of Content Addressable Storage for Distributed File Systems”, Proceedings of USENIX 2003.
Many applications and systems that operate well over high-speed connections need to be adapted to run on slower speed connections. For example, operating a file system over a LAN works well, but often files need to be accessed where a high-speed path, such as over a LAN, is not available along the entire path from the client needing access to the file and the file server serving the file. Similar design problems exist for other network services, such as e-mail services, Web services, computational services, multimedia, video conferencing, database querying, office collaboration, etc.
In a networked file system, for example, files used by applications in one place might be stored in another place. In a typical scenario, a number of users operating at computers networked throughout an organization and/or a geographic region share a file or sets of files that are stored in a file system. The file system might be near one of the users, but it is often remote from most of the users. Even when files are remote, users expect or would prefer that the files appear to be near their sites. Otherwise, slow performance for file access can negatively impact a user's work productivity as time is wasted waiting for files to save and load, or waiting for application pauses while an application performs file I/O, and so forth.
As used herein, “client” generally refers to a computer, computing device, peripheral, electronics, or the like, that makes a request for data or an action, while “server” generally refers to a computer, computing device, peripheral, electronics, or the like, that operates in response to requests for data or action made by one or more clients.
A request can be for operation of the computer, computing device, peripheral, electronics, or the like, and/or for an application being executed or controlled by the client. One example is a computer running a word processing program that needs a document stored externally to the computer and uses a network file system client to make one or more requests over a network to a file server. Another example is a request for an action directed at a server that itself performs the action, such as a print server, a processing server, a control server, an equipment interface server, and I/O (input/output) server, etc.
A request is often satisfied by a response message supplying the data requested or performing the action requested, or a response message indicating an inability to service the request, such as an error message or an alert to a monitoring system of a failed or improper request. A server might also block a request, forward a request, transform a request, or the like, and then respond to the request or not respond to the request.
In some instances, an entity normally thought of as a server can act as a client and make requests and an entity normally thought of as a client can act as a server and respond to requests. Furthermore, a single entity might be both a server and a client, for other servers/clients or for itself. For example, a desktop computer might be running a database client and a user interface for the database client. If the desktop computer user manipulated the database client to cause it to make a request for data, the database client would issue a request, presumably to a database server. If the database server were running on the same desktop computer, the desktop computer would be, in effect, making a request to itself. It should be understood that, as used herein, clients and servers are often distinct and separated by a network, physical distance, security measures and other barriers, but those are not required characteristics of clients and servers.
In some cases, clients and servers are not necessarily exclusive. For example, in a peer-to-peer network, one peer might make a request of another peer but might also serve responses to that peer. Therefore, it should be understood that while the terms “client” and “server” are typically used herein as the actors making “requests” and providing “responses”, respectively, those elements might take on other roles not clearly delineated by the client-server paradigm.
In general, communication over a network involves bidirectional exchange of data between two computing entities, where one entity is the client and initiates a transaction by opening a transport channel to another entity (the server). Typically, the client sends a request or set of requests via a set of network and transport protocols, and the request or requests are processed by the server, returning responses. Many protocols are connection-based, whereby the two cooperating entities (sometimes known as “hosts”) negotiate a communication session to begin the information exchange. In setting up a communication session, the client and the server might each maintain state information for the session, which may include information about the capabilities of each other. At some level, the session forms what is logically (or physically, in some cases) considered a “connection” between the client and server. Once the connection is established, communication between the client and server can proceed using state from the session establishment and other information and send messages between the client and the server, wherein a message is a data set comprising a plurality of bits in a sequence, possibly packaged as one or more packets according to an underlying network protocol. Typically, once the client and the server agree that the session is over, each side disposes of the state information for the connection or connections underlying the session, other than possibly saving log information or other similar historical data concerning the session or its prior existence.
To effect such communication, computing hosts make use of a set of networking protocols for exchanging information between the two computing hosts. Many networking protocols have been designed and deployed, with varying characteristics and capabilities. At the network layer, the Internet Protocol (IP) is ubiquitous and is responsible for routing packets from one end host to another. At the transport layer, the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP) are two examples of protocols that are in common use today. TCP provides a reliable, connection-oriented service on top of the unreliable packet delivery service provided by IP. Various other networking protocols might also be used.
A message from a client to a server or vice-versa traverses one or more network “paths” connecting the client and server. A basic path would be a physical cable connecting the two hosts. More typically, a path involves a number of physical communication links and a number of intermediate devices (e.g., routers) that are able to transmit a packet along a correct path to the server, and transmit the response packets from the server back to the client. These intermediate devices typically do not modify the contents of a data packet; they simply pass the packet on in a correct direction. However, it is possible that a device that is in the network path between a client and a server could modify a data packet along the way.
A “transport proxy” or “proxy” is a device situated in the network that terminates a transport-level connection to perform various transformations or provide enhanced services to end hosts. The proxy is said to be “transparent”, as the term is used herein, with respect to a client-server connection if the packets received by the client contains the server's network address and the packets received by the server contain the client's network address so as to the end hosts, the packets appear to the client as originating from the server even though they originate from the proxy, and vice versa.
As used herein, the terms “near”, “far”, “local” and “remote” might refer to physical distance, but more typically they refer to effective distance. The effective distance between two computers, computing devices, servers, clients, peripherals, etc. is, at least approximately, a measure of the difficulty of getting data between the two computers. For example, where file data is stored on a hard drive connected directly to a computer processor using that file data, and the connection is through a dedicated high-speed bus, the hard drive and the computer processor are effectively “near” each other, but where the traffic between the hard drive and the computer processor is over a slow bus, with more intervening events possible to waylay the data, the hard drive and the computer processor are said to be farther apart.
Greater and lesser physical distances need not correspond with greater and lesser effective distances. For example, a file server and a desktop computer separated by miles of high-quality and high-bandwidth fiber optics might have a smaller effective distance compared with a file server and a desktop computer separated by a few feet and coupled via a wireless connection in a noisy environment.
The two primary impediments to application protocol performance over a WAN are typically high delay (or latency) and limited bandwidth. The “bandwidth” of a network of channel refers to measure of the number of bits that can be transmitted over a link or path per unit of time. “Latency” refers to a measure of the amount of time that transpires while the bits traverse the network, e.g., the time it takes a given bit transmitted from the sender to reach the destination. “Round-trip time” refers to the sum of the “source-to-destination” latency and the “destination-to-source” latency. If the underlying paths are asymmetric, the round-trip latency might be different than twice a one-way latency. The term “throughput” is sometimes confused with bandwidth but refers to a measure of an attained transfer rate that a client-server application, protocol, etc. achieves over a network path. Throughput is typically less than the available network bandwidth.
WAN network bandwidth limits almost always impact client-server application throughput across the WAN, but more bandwidth often can be obtained from a network service provider, usually at a cost. With latency, however, lower latency is not available from a network service provider at any price, using known laws of physics, if it would require communications to occur at speeds faster than the speed of light. In some cases, network latency is the bottleneck on performance or throughput. This is often the case with window-based transport protocols such as TCP or a request-response protocol such as the Common Internet File System (CIFS) protocol or the Network File System (NFS) protocol. High network latency particularly slows down “chatty” applications, even if the actual amounts of data transmitted in each transaction are not large. “Chatty” applications are those in which client-server interactions involve many back-and-forth steps that might not even depend on each other. Adding bandwidth (or compressing data) does not improve the throughput of these protocols/applications when the round-trip time exceeds some critical point and once the latency reaches that critical point, throughput decays quickly.
This phenomenon can be understood intuitively: the rate of work that can be performed by a client-server application that executes serialized steps to accomplish its tasks is inversely proportional to the round-trip time between the client and the server. If the client-server application is bottlenecked in a serialized computation (i.e., it is “chatty”), then increasing the round-trip by a factor of two causes the throughput to decrease by a factor of two because it takes twice as long to perform each step (while the client waits for the server and vice versa).
Given the high costs and performance challenges of WAN-based enterprise computing and communication, many approaches have been proposed for dealing with these problems.
Perhaps the simplest approach to dealing with performance is to simply upgrade the available bandwidth in the network. Of course this is the most direct solution, but it is not always the most effective approach. First of all, contrary to popular belief, bandwidth is not free and the costs add up quickly for large enterprises that may have hundreds of offices. Second, as described above, adding bandwidth does not necessarily improve throughput. Third, in some places adding more bandwidth is not possible, especially across international sites, in remote areas, or where it is simply too expensive to justify.
Another approach involves data compression. Compression is a process of representing one set of data with another set of data wherein the second set of data is, on average, a smaller number of bits than the first set of data, such that the first set of data, or at least a sufficient approximation of the first set of data, can be recovered from an inverse of the compression process in most cases. Compression allows for more efficient use of a limited bandwidth and might result in less latency, but in some cases, no latency improvement occurs. In some cases, compression might add to the latency, if time is needed to compress data after the request is made and time is needed to decompress the data after it is received. This may be able to be improved if the data can be compressed ahead of time, before the request is made, but that may not be feasible if the data is not necessarily available ahead of time for compression, or if the volume of data from which the request will be served is too large relative to the amount of data likely to be used.
One way to deploy compression is to embed it in applications. For example, a Web server can compress the HTML pages it returns before delivering them across the network to end clients. This function has also been implemented within application delivery controllers that sit in front of Web servers to provide connection optimization, compression, and so forth.
Another approach is to deploy compression in the network without having to modify the applications. For many years, network devices have included compression options as features (e.g., in routers, modems, dedicated compression devices, etc). See, for example, [Rand96]. Prior to the advent of WAN accelerators, such as those shown in McCanne I and McCanne III, some vendors had developed dedicated network compression devices that compressed IP packets at the network layer in an attempt to enhance network performance. This is a reasonable thing to do, but the effectiveness is limited. When performance is limited by network latency, compressing the underlying data will have little or no impact on application performance. Moreover, most methods of lossless data compression typically reduce the amount of data (i.e., bandwidth) by a factor of 1.5 to 4, depending on the inherent redundancy present. While helpful, it is not enough to dramatically change performance if the amount of data being sent is large or similar data is sent repeatedly, perhaps over longer time scales.
Another approach to network compression is suggested by [Spring00]. As described in that reference, network packets that are similar to recently transmitted packets can be reduced in size by identifying repeated strings and replacing the repeated strings with tokens to be resolved from a shared packet cache at either end of a network link. This approach, while beneficial, has a number of shortcomings. Because it operates solely on individual packets, the performance gains that accrue are limited by the ratio of the packet payload size to the packet header (since the packet header is generally not compressible using the described technique). Also, because the mechanism is implemented at the packet level, it only applies to regions of the network where two ends of a communicating path have been configured with the device. This configuration can be difficult to achieve, and may be impractical in certain environments. Also, by caching network packets using a relatively small memory-based cache with a first-in first-out replacement policy (without the aid of, for instance, a large disk-based backing store), the efficacy of the approach is limited to detecting and exploiting communication redundancies that are fairly localized in time. Finally, because this approach has no ties into the applications or servers that generate the (redundant) network traffic, there is no ability to anticipate where data might be used and pre-stage that data in the far-end cache providing potential further acceleration and optimization of network traffic.
Rather than compress the data, another approach to working around WAN bottlenecks is to replicate servers and serve data from local servers for quick access. This approach in particular addresses the network latency problem because a client in a remote site can now interact with a local server rather than a remote server. There are several methods available to enterprises to store redundant copies of data in replicated file systems, redundant or local storage servers, or by using any number of distributed file systems. The challenge with this kind of approach is the basic problem of managing the ever-exploding amount of data, which requires scaling up storage, application and file servers in many places, and trying to make sure that the files people need are indeed available where and when they are needed. Moreover, these approaches are generally non-transparent, meaning the clients and servers must be modified to implement and interact with the agents and/or devices that perform the replication function. For example, if a file server is replicated to a remote branch, the server must be configured to send updates to the replica and certain clients must be configured to interact with the replica while others need to be configured to interact with the original server.
Rather than replicate servers, another approach is to deploy transport-level “proxies”, which function as performance-enhancing intermediaries between the client and the server. In this case, a proxy is the terminus for the client connection and initiates another connection to the server on behalf of the client. In some cases, the proxy may operate solely at the transport layer, while in other cases, the proxy may possess application-layer knowledge and use this knowledge to transform the application-level payloads to enhance end-to-end performance. In addition, the proxy may connect to one or more other proxies that in turn connect to the server and these cascaded proxies may cooperate with one other to perform optimization. Each proxy may forward, modify, or otherwise transform the transactions as they flow from the client to the server and vice versa. Examples of proxies include (1) Web proxies that enhance performance through caching or enhance security by controlling access to servers, (2) mail relays that forward mail from a client to another mail server, (3) DNS relays that cache DNS name resolutions, (4) WAN accelerators, and so forth.
One problem that must be overcome when deploying proxies is that of directing client requests to the proxy instead of to the destination server. One mechanism for accomplishing this is to configure each client host or process with the network address information of the proxy. This requires that the client application have an explicit proxy capability, whereby the client can be configured to direct requests to the proxy instead of to the server. In addition, this type of deployment requires that all clients must be explicitly configured and that can be an administrative burden on a network administrator.
One way around the problems of explicit proxy configuration is to deploy a proxy in a transparent configuration. The presence of the transparent proxy is not made explicitly known to the client process, so all client requests proceed along the network path towards the server as they would have if there were no transparent proxy. This might be done by placing the transparent proxy host in the network path between the client and the server. A layer-4 switch can then be employed so the proxy host can intercept client connections and handle the requests via the proxy. For example, the layer-4 switch could be configured so that all Web connections (i.e., TCP connections on port 80) are routed to a local proxy process. The local proxy process can then perform operations on behalf of the server. For example, the local proxy process could respond to the request using information from its local cache. When intercepting the connection, the layer-4 switch performs network address translation (“NAT”) so the connection appears to the client as having been terminated at the origin server, even though the client communicates directly with the proxy. In this manner, the benefits of a proxy can be realized without the need for explicit client configuration.
Some benefits of a transparent proxy require that a proxy pair exists in the network path, as is common with WAN accelerators. For example, if a proxy is used to transform data from a particular client-server connection in some way, a second proxy preferably untransforms the data. For example, where a connection between a client and a server is to be processed, a proxy near the client would transform the received transport payloads before sending said payloads to a proxy situated near the server, which would untransform the payload and send the original data on to the server, thereby providing for transparent transformation of data flowing between the client and the server.
For actions that require a proxy pair, preferably both proxies in the proxy pair do not perform a transformation unless they can be assured of the existence and operation of the other proxy in the proxy pair. Where each proxy must be explicitly configured with indications of the pairs to which it belongs and to the identity of the other members of those pairs, the administrative burden on a network administrator might well make some operations infeasible if they require proxy pairs. Even where a proxy is interposed in a network and gets all of the traffic from a client or server, it still must discover the other member for each proxy pair the proxy needs, if the proxy is to perform actions that require proxy pairs. As such, a proxy might be inserted in the network path between client and server via multiple network ports as described in Demmer I. As described there, the interception of network packets can be carried out transparently without the need for a layer-4 switch or some other method of network-based packet interception.
With a proxy situated between the client and server, the performance impairments of network latency can be addressed by having the proxy cache data. A proxy that performs this function has been called a “network cache”. Generally speaking, caching is common technique in computer system design and involves a process of storing previously transmitted or obtained results in the hopes that an entity will request the results again and receive a response more quickly from the cache than if the results had to come from the original provider. Caching also provides some help in mitigating both latency and bandwidth bottlenecks, but in some situations it does not help much. For example, where a single processor is retrieving data from memory it controls and does so in a repetitive fashion, as might be the case when reading processor instructions from memory, caching can greatly speed a processor's tasks. Similarly, file systems have employed caching mechanisms to store recently accessed disk blocks in host memory so that subsequent accesses to cached blocks are completed much faster than reading them in from disk again as in [McKusick84], [Rosenblum92], etc. In a typical cache arrangement, a requestor requests data from some memory, device or the like and the results are provided to the requestor and stored in a cache having a faster response time than the original device supplying the data. Then, when the requester requests that data again, if it is still in the cache, the cache can return the data in response to the request before the original device could have returned it and the request is satisfied that much sooner.
Caching has its difficulties, one of which is that the data might change at the source and the cache would then be supplying “stale” data to the requester. This is the “cache consistency” problem. Because of this, caches are often “read only”, requiring that changes to data be transmitted through the cache back to the source in a “write-through” fashion and that other caches that consequently end up with “stale” data be invalidated. Another problem with caching is that the original source of the data might want to track usage of data and would not be aware of uses that were served from the cache as opposed to from the original source. For example, where a Web server is remote from a number of computers running Web browsers that are “pointed to” that Web server, the Web browsers might cache Web pages from that site as they are viewed, to avoid delays that might occur in downloading the Web page again. While this would improve performance in many cases, and reduce the load on the Web server, the Web server operator might try to track the total number of “page views” but would be ignorant of those served by the cache. In some cases, an Internet service provider might operate the cache remote from the browsers and provide cached content for a large number of browsers, so a Web server operator might even miss unique users entirely.
Additionally, the mechanism underlying Web caching provides only a loose model for consistency between the origin data and the cached data. Generally, Web data is cached for a period of time based on heuristics or hints in the transactions independent of changes to the origin data. This means that cached Web data can occasionally become inconsistent with the origin server and such inconsistencies are simply tolerated by Web site operators, service providers, and users as a reasonable performance trade-off. Unfortunately, this model of loose consistency is entirely inappropriate for general client-server communication like networked file systems. When a client interacts with a file server, the consistency model must be wholly correct and accurate to ensure proper operation of the application using the file system.
Where loose consistency can be tolerated, caching can work remarkably well. For example, the Domain Name System (DNS), dating back to the early 1980's, employs caching extensively to provide performance and scalability across the wide area. In this context, providing only loose consistency semantics has proven adequate. In DNS, each “name server” manages a stored dataset that represents so-called “resource records” (RR). While DNS is most commonly used to store and manage the mappings from host names to host addresses in the Internet (and vice versa), the original DNS design and its specification allow resource records to contain arbitrary data. In this model, clients send queries to servers to retrieve data from the stored data set managed by a particular server. Clients can also send queries to relays, which act as proxies and cache portions of master name servers' stored datasets. A query can be “recursive”, which causes the relay to recursively perform the query on behalf of the client. In turn, the relay can communicate with another relay and so forth until the master server is ultimately contacted. If any relay on the path from the client to the server has data in its cache that would satisfy the request, then it can return that data back to the requester.
Some solutions to network responsiveness deal with the problem at the file system or at network layers. One proposed solution is the use of a low-bandwidth network file system, such as that described in [LBFS]. In that system, called LBFS, clients employ “whole file” client-side caching whereby upon a file open operation, the client fetches all the data in the file from the server, then operates on the locally cached copy of the file data. If the client makes changes to the file, those changes are propagated back to the server when the client closes the file. To optimize these transfers, LBFS replaces pieces of the file with hashes when transmitting file data over the network and the recipient uses the hashes in conjunction with a local database to resolve the hashes to the original portions of the file. Such systems have limitations in that they are tied to file systems and generally require modification of the clients and servers between which responsiveness is to be improved. Furthermore, the hashing scheme operates over blocks of relatively large (average) size, which works poorly when files are subject to fine-grained changes over time. Finally, the scope of the LBFS design is limited to a network file system protocol whereby the LBFS hashing scheme is not shared with other data or network services.
In a scheme that builds on some of the concepts of LBFS, [Tolia03] proposed the CASPER distributed file system. CASPER uses “file recipes” to represent the data of each file, where a file recipe is list of hashes of the sequences of bytes that represent the file. When a client opens a file, a proxy at the client fetches the file recipe from a “recipe server” that is co-located with the file server. The client then attempts to reconstruct the file data by issuing requests to a local server that can resolve a recipe hash to its corresponding data. Any recipe hashes that cannot be resolved locally are sent to the recipe server for resolution. The approach, however, does not provide details for how client writes would be handled and defers such mechanism to future work. Also, like LBFS, the scope of the CASPER design is limited to a network file system protocol whereby the hashes used in file receipts are not shared with other data or network services. Also, while file recipes reduce the amount of network traffic required to serve files, there is no deduplication of the data that comprises the files on the server and, in fact, storage requirements are increased because each file's recipe must be stored in addition to each file's data. The CASPER design is also limited to a configuration wherein whole-file caching is carried out on the client system and the client-side file system is configured explicitly to intercommunicate with the local CASPER implementation. Moreover, CASPER has no mechanisms to optimize other operations like directory lookups, directory modifications, file creation and deletion, and so forth.
Cache consistency in the context of network file systems has been studied. The primary challenge is to provide a consistent view of a file to multiple clients when these clients read and write the file concurrently. When multiple clients access a file for reading and at least one client accesses the same file for writing, a condition called “concurrent write sharing” occurs and measures must be taken to guarantee that reading clients do not access stale data after a writing client updates the file.
In the original Network File System (NFS) [Sandberg85], caching is used to store disk blocks that were accessed across the network sometime in the past. An agent at the client maintains a cache of file system blocks and, to provide consistency, their last modification time. Whenever the client reads a block, the agent at the client performs a check to determine if the requested block is in its local cache. If it is and the last modification time is less than some configurable parameter (to provide a medium level of time-based consistency), then the block is returned to the client by the agent. If the modification time is greater than the parameter, then the last-modification time for the file is fetched from the server. If that time is the same as the last modification time of the data in the cache, then the request is returned from the cache. Otherwise, the file has been modified so all blocks of that file present in the local cache are flushed and the read request is sent to the server. To provide tighter consistency semantics, NFS can employ locking via the NFS Lock Manager (NLM). Under this configuration, when the agent at the client detects the locking condition, it disables caching and thus forces all requests to be serviced at the server, thereby ensuring strong consistency.
When blocks are not present in the local cache, NFS attempts to combat latency with the well-known “read-ahead” algorithm, which dates back to at least the early 1970's as it was employed in the Multics I/O System [Feiertag71]. The read-ahead algorithm exploits the observation that file-system clients often sequentially read large sets of contiguous blocks from a file. That is, when a client accesses block k, it is likely in the future to access block k+1. In read-ahead, a process or agent fetches blocks ahead of the client's request and stores those blocks in the cache in anticipation of the client's forthcoming request. In this fashion, NFS can mask the latency of fetching blocks from a server when the read-ahead turns out to successfully predict the client read patterns. Read-ahead is widely deployed in modern file systems.
In the Andrew File System (AFS) [Howard88], “whole-file” caching is used instead of block-based caching to enhance file system performance over a network. Here, when a client opens a file, an agent at the client checks to see if the file is resident in its local disk cache. If it is, it checks with the server to see if the cached file is valid (i.e., that there have not been any modifications since the file was cached). If not (or if the file was not in the cache to begin with), a new version of the file is fetched from the server and stored in the cache. All client file activity is then intercepted by the agent at the client and operations are performed on the cached copy of the file. When the client closes the file, any modifications are written back to the server. This approach provides only “close-to-open” consistency because changes by multiple clients to the same file are only serialized and written back to the server on each file close operation.
Another mechanism called “opportunistic locking” is employed by the Server Message Block (SMB) Protocol, now called CIFS, to provide consistency. In this approach, when a file is opened the client (or client agent) can request an opportunistic lock or oplock associated with the file. If the server grants the oplock, then the client can assume no modifications will be made to file during the time the lock is held. If another client attempts to open the file for writing (i.e., concurrent write sharing arises), then the server breaks the oplock previously granted to the first client, then grants the second client write access to the file. Given this condition, the first client is forced to send all reads to the server for the files for which it does not hold an oplock. A similar mechanism was employed in the Sprite distributed file system, where the server would notify all relevant clients when it detected concurrent write sharing [Nelson88].
When consistency mechanisms are combined with network caching, a great deal of complexity arises. For example, if a data caching architecture such as that used by DNS or the Web were applied to file systems, it would have to include a consistency protocol that could manage concurrent write sharing conditions when they arise. In this model, each node, or network cache, in the system contains a cache of file data that can be accessed by different clients. The file data in the cache is indexed by file identification information, relating the data in the cache to the server and file it came from. Just like NFS, a cache could enhance performance in certain cases by using read-ahead to retrieve file data ahead of a client's request and storing said retrieved data in the cache. Upon detecting when concurrent write sharing occurs, such a system could force all reads and writes to be synchronized at a single caching node, thereby assuring consistency. This approach is burdened by a great deal of complexity in managing consistency across all the caches in the system. Moreover, the system's concurrency model assumes that all file activity is managed by its caches; if a client modifies data directly on the server, consistency errors could arise. Also, its ability to overcome network latency for client accesses to data that is not resident in the cache is limited to performing file-based read-ahead. For example, in NFS, a client that opens a file must look up each component of the path (once per round-trip) to ultimately locate the desired file handle and file-based read-ahead does nothing to eliminate these round-trips.
A different approach to dealing with network latency when clients access data that is not in the cache is to predict file access patterns. A number of research publications describe approaches that attempt to predict the next file (or files) a client might access based on the files it is currently accessing and has accessed in the past, see [Amer02], [Lei97], [Griffioen94], [Kroeger99], for example. Based on these prediction models, these systems pre-fetch the predicted files by reading them into a cache. Unfortunately, this approach presumes the existence of a cache and thus entails the complexities and difficulties of cache coherency.
In the context of the World-wide Web, other research has applied this prediction concept to Web objects [Padmanabhan96]. In this approach, the server keeps track of client access patterns and passes this information as a hint to the client. The client in turn can choose to pre-fetch into its cache the URLs that correspond to the hinted objects. Again, this approach presumes the existence of a cache, and can be deployed without disrupting the semantics of the Web protocols only because the Web is generally read-only and does not require strong consistency.
As enterprises have come to depend more and more on distributed access to data, the performance problems described herein have become more pervasive and noticed by protocol and application designers. In the context of file server protocols, attempts have been made in recent versions of NFS to address some of the challenges file server access over the wide area. In particular, NFS version 4 (NFSv4) includes the notion of “compound operations” where multiple requests (and responses) can be bundled into a single network message payload and, at the server, the result from one request is fed to the input of the next operation in that same message, and so forth. The goal here is to reduce the number of wide-area round trips that transpires between the client and server thereby enhancing performance. Also, NFSv4 includes a mechanism called a “delegation” whereby a file can be cached at the client and operations performed on the file locally. While compound operations and delegations can be exploited to enhance wide area file server performance, they have limited impact. In particular, while compound operations are potentially helpful, in practice it is difficult for a client implementation to utilize them extensively enough to impact WAN performance without changing the way an application interfaces to and interacts with the file system and making such changes is, in general, not practical. Likewise, while delegations can also be helpful, they still rely upon transporting the data from the file server to the client over the network, which may be a costly and performance impairing operation in many environments. In short, while recent improvements to file system design are in general helpful to wide-area performance problems, they fall short of solving such problems comprehensively.
While the advent of WAN accelerators (such as those described in McCanne I, McCanne II, and McCanne III) has proven that excellent performance for network file access can be achieved over a WAN, certain performance improvements might be possible to the way that existing file servers and WAN accelerators interoperate.
For example, there still may remain duplication of effort, such as when a file is read multiple times by one or more clients. Another problem might occur with redundant transformations. Yet another problem with existent file server protocol designs is that they presume a tight coupling between a file and its underlying data and this coupling imposes certain constraints on file server network protocols. When a file is to be accessed from a remote location, the data underlying each accessed portion of the file must be transported over the network. This is typically accomplished by the client issuing read requests to the server for the data desired with respect to the file in question. Likewise if the client modifies the file, new data representing the modifications must be transported over the network using write requests sent to the server. If there are pending writes when the client closes the file, the client must typically block and wait for all for the pending writes to completed and be acknowledged by the server before the close operation can finish. This is typically perceived by the user as the application hanging or waiting for a save operation to complete. When the network has high latency or inefficient bandwidth, these sorts of data operations can be highly inefficient and burdensome to the user. This overall issue is sometimes called the “cold write performance” problem because write performance suffers for “cold” data, where so-called cold data is data that is not currently in the WAN accelerators' segment stores.
The cold write performance problem is illustrated in FIG. 1, which occurs when the client writes new data to a file and immediately closes that file. When new data is encountered by a WAN accelerator and that data is not present its segment store, the WAN accelerator must transmit the new data across the network. For example, in McCanne I, new data is sent in the form of segment bindings for newly established segments. FIG. 1 shows the exchange of messages in this case. Here the file is first opened, then write commands 100, 101, . . . , 104 are transmitted from the client to the server, along with the data payloads that comprise the write requests. To optimize the protocol, the WAN accelerator pipelines the write requests by pre-acknowledging the write requests with write acknowledgments 110, 111, . . . , 114. Because the write messages are pipelined in this fashion, the protocol exchange from the client's perspective performs efficiently up until the close request is processed. At this point, the client-side WAN accelerator is unable to pre-acknowledge the close request until all of the pipelined writes have been processed at the server ensuring that all of the data has been safely written. Otherwise, if the client-side WAN accelerator acknowledged the close request before this occurred, the file server protocol semantics could be violated and conditions could arise under which the client had received a successful acknowledgment of the writes and the close, but the data never made it safely onto stable storage (e.g., because the file server crashes after the close is acknowledged but before the data is written to disk). Because of this, the client-side WAN accelerator must defer acknowledging the close until all the writes have been acknowledged at the server. When there are many new segments to be transmitted over a possibly bandwidth-constrained link, this operation can take a non-trivial amount of time. Thus, the client can perceive a noticeable delay while the close is waiting on the writes to complete. This delay is depicted as the “hang time” interval indicated on the diagram.
Another challenge for WAN accelerators with regard to file server protocols is the so-called CAP theorem. Conjectured by Brewer in 2000 [Brewer00], and later proved by Gilbert and Lynch [Gilbert02], the CAP theorem states that a distributed data services architecture cannot achieve both data consistency and data availability in the face of network partitions. In other words, consistency (C), availability (A), and partition-tolerance (P), or CAP, cannot all simultaneously coexist, but rather, at most two of the three said properties can co-exist. In general, with respect to file server protocols, WAN accelerators favor consistency over availability and do not allow a client to read or modify a file in the face of a network partition. However, this mechanism works only for files that are presently open when the network partition arises. If a client at a remote site wishes to access a file while a network partition exists, a WAN accelerator has no means by which to allow such access to occur without being able to contact the origin file server. It would be desirable, however, to allow such accesses to occur based on IT policies that control where the available access points lie for a given file server (or file share on a file server) in the face of network partitions.
Unfortunately, while many of the above techniques solve some aspects of WAN performance and file storage and serving problems, they still have some shortcomings and several challenges remain in the way that existing WAN accelerators and file servers intercommunicate. In view of the above problems and the limitations with existing solutions, improvements can be made in how file servers and WAN accelerators operate and interact over a network.