The present invention relates generally to systems for moving data through limited channels efficiently where the channels might be limited by bandwidth and/or latency, and more particularly to having data available in response to a request for data over a limited channel faster than if the data were sent unprocessed in response to the request, possibly taking into account varying applications, systems and protocols of and for the requested data.
Local Area Network (LAN) communication is characterized by generous bandwidths, low latencies and considerable enterprise control over the network. By contrast, Wide-Area Networks (WANs) often have lower bandwidths and higher latencies than LANs and often have a measure of network control that is outside the enterprise for which the WAN is being used.
Wide-area client-server applications are a critical part of almost any large enterprise. A WAN might be used to provide access to widely used and critical infrastructure, such as file servers, mail servers and networked storage. This access most often has very poor throughput when compared to the performance across a LAN. Whether an enterprise is taking a centralized approach or a distributed approach, high performance communication across the WAN is essential in order to minimize costs and maximize productivity.
Many applications and systems that operate well over high-speed connections need to be adapted to run on slower speed connections. For example, operating a file system over a local area network (LAN) works well, but often files need to be accessed where a high-speed link, such as a LAN, is not available along the entire path from the client needing access to the file and the file server serving the file. Similar design problems exist for other network services, such as e-mail services, computational services, multimedia, video conferencing, database querying, office collaboration, etc.
In a networked file system, for example, files used by applications in one place might be stored in another place. In a typical scenario, a number of users operating at computers networked throughout an organization and/or a geographic region share a file or sets of files that are stored in a file system. The file system might be near one of the users, but typically it is remote from most of the users, but the users often expect the files to appear to be near their sites.
As used herein, “client” generally refers to a computer, computing device, peripheral, electronics, or the like, that makes a request for data or an action, while “server” generally refers to a computer, computing device, peripheral, electronics, or the like, that operates in response to requests for data or action made by one or more clients.
A request can be for operation of the computer, computing device, peripheral, electronics, or the like, and/or for an application being executed or controlled by the client. One example is a computer running a word processing program that needs a document stored externally to the computer and uses a network file system client to make a request over a network to a file server. Another example is a request for an action directed at a server that itself performs the action, such as a print server, a processing server, a control server, an equipment interface server, and I/O (input/output) server, etc.
A request is often satisfied by a response message supplying the data requested or performing the action requested, or a response message indicating an inability to service the request, such as an error message or an alert to a monitoring system of a failed or improper request. A server might also block a request, forward a request, transform a request, or the like, and then respond to the request or not respond to the request.
In some instances, an object normally thought of as a server can act as a client and make requests and an object normally thought of as a client can act as a server and respond to requests. Furthermore, a single object might be both a server and a client, for other servers/clients or for itself. For example, a desktop computer might be running a database client and a user interface for the database client. If the desktop computer user manipulated the database client to cause it to make a request for data, the database client would issue a request, presumably to a database server. If the database server were running on the same desktop computer, the desktop computer would be, in effect, making a request to itself. It should be understood that, as used herein, clients and servers are often distinct and separated by a network, physical distance, security measures and other barriers, but those are not required characteristics of clients and servers.
In some cases, clients and servers are not necessarily exclusive. For example, in a peer-to-peer network, one peer might a request of another peer but might also serve responses to that peer. Therefore, it should be understood that while the terms “client” and “server” are typically used herein as the actors making “requests” and providing “responses”, respectively, those elements might take on other roles not clearly delineated by the client-server paradigm.
Generally, a request-response cycle can be referred to as a “transaction” and for a given transaction, some object (physical, logical and/or virtual) can be said to be the “client” for that transaction and some other object (physical, logical and/or virtual) can be said to be the “server” for that transaction.
As explained above, a transaction over a network involves bidirectional communication between two computing entities, where one entity is the client and initiates a transaction by opening a network channel to another entity (the server). Typically, the client sends a request or set of requests via a set of networking protocols over that network channel, and the request or requests are processed by the server, returning responses. Many protocols are connection-based, whereby the two cooperating entities (sometimes known as “hosts”) negotiate a communication session to begin the information exchange. In setting up a communication session, the client and the server might each maintain state information for the session, which may include information about the capabilities of each other. At some level, the session forms what is logically (or physically, in some cases) considered a “connection” between the client and server. Once the connection is established, communication between the client and server can proceed using state from the session establishment and other information and send messages between the client and the server, wherein a message is a data set comprising a plurality of bits in a sequence, possibly packaged as one or more packets according to an underlying network protocol. Typically, once the client and the server agree that the session is over, each side disposes of the state information for that transaction, other than possibly saving log information.
To realize a networking transaction, computing hosts make use of a set of networking protocols for exchanging information between the two computing hosts. Many networking protocols have been designed and deployed, with varying characteristics and capabilities. The Internet Protocol (IP), Transmission Control Protocol (TCP), and User Datagram Protocol (UDP) are three examples of protocols that are in common use today. Various other networking protocols might also be used.
Since protocols evolve over time, a common design goal is to allow for future modifications and enhancements of the protocol to be deployed in some entities, while still allowing those entities to interoperate with hosts that are not enabled to handle the new modifications. One simple approach to accomplishing interoperability is a protocol version negotiation. In an example of a protocol version negotiation, one entity informs the other entity of the capabilities that the first entity embodies. The other entity can respond with the capabilities that the other entity embodies. Through this negotiation, each side can be made aware of the capabilities of the other, and the channel communication can proceed with this shared knowledge. To be effective, this method must ensure that if one entity advertises a capability that the other entity does not understand, the second entity should still be able to handle the connection. This method is used in both the IP and TCP protocols—each provides a mechanism by which a variable length set of options can be conveyed in a message. The specification for each protocol dictates that if one entity does not have support for a given option, it should ignore that option when processing the message. Other protocols may have a similar features that allow for messages to contain data that is understood by some receivers of the data but possibly not understood by other receivers of the data, wherein a receiver that does not understand the data will not fail in its task and will typically forward on the not understood data such that another entity in the path will receive that data.
A message from a client to a server or vice-versa traverses one or more network “paths” connecting the client and server. A basic path would be a physical cable connecting the two hosts. More typically, a path involves a number of physical communication links and a number of intermediate devices (e.g., routers) that are able to transmit a packet along a correct path to the server, and transmit the response packets from the server back to the client. These intermediate devices typically do not modify the contents of a data packet; they simply pass the packet on in a correct direction. However, it is possible that a device that is in the network path between a client and a server could modify a data packet along the way. To avoid violating the semantics of the networking protocols, any such modifications should not alter how the packet is eventually processed by the destination host.
As used herein, the terms “near”, “far”, “local” and “remote” might refer to physical distance, but more typically they refer to effective distance. The effective distance between two computers, computing devices, servers, clients, peripherals, etc. is, at least approximately, a measure of the difficulty of getting data between the two computers. For example, where file data is stored on a hard drive connected directly to a computer processor using that file data, and the connection is through a dedicated high-speed bus, the hard drive and the computer processor are effectively “near” each other, but where the traffic between the hard drive and the computer processor is over a slow bus, with more intervening events possible to waylay the data, the hard drive and the computer processor are said to be farther apart.
Greater and lesser physical distances need not correspond with greater and lesser effective distances. For example, a file server and a desktop computer separated by miles of high-quality and high-bandwidth fiber optics might have a smaller effective distance compared with a file server and a desktop computer separated by a few feet and coupled via a wireless connection in a noisy environment.
Causes of Poor WAN Throughput
The two primary causes of the slow throughput on WANs are well known: high delay (or latency) and limited bandwidth. The “bandwidth” of a network of channel refers to measure of the number of bits that can be transmitted over a link or path per unit of time (usually measured in number of bits per unit time). “Latency” refers to a measure of the amount of time that transpires while the bits traverse the network, e.g., the time it takes a given bit transmitted from the sender to reach the destination (usually measured in time units). “Round-trip time” refers to the sum of the “source-to-destination” latency and the “destination-to-source” latency. If the underlying paths are asymmetric, the round-trip latency might be different than twice a one-way latency. The term “throughput” is sometimes confused with bandwidth but refers to a measure of an attained transfer rate that a client-server application, protocol, etc. achieves over a network path. Throughput is typically less than the available network bandwidth.
The speed of light, a fundamental and fixed constant, implies that information transmitted across a network always incurs some nonzero latency as it travels from the source to the destination. In practical terms, this means that sending a packet from the Silicon Valley in California to New York and back takes at least 30 milliseconds (ms), the time information in an electromagnetic signal would take to travel that distance in a direct path cross-country. In reality, this cross-country round trip time is more in the range of 100 ms or so, as signals in fiber or copper do not always travel at the speed of light in a vacuum and packets incur processing delays through each switch and router. This amount of latency is quite significant as it is at least two orders of magnitude higher than typical sub-millisecond LAN latencies.
Other round-trips might have more latency. Round trips from the West Coast of the U.S. to Europe can be in the range of 100-200 ms, and some links using geo-stationary satellites into remote sites can have latencies in the 500-800 ms range. With latencies higher than about 50 ms, many client-server protocols and applications will function poorly relative to a LAN, as those protocols and applications expect very low latency.
While many employees routinely depend upon Fast Ethernet (100 Mbps) or Gigabit Ethernet (1 Gbps) within most corporate sites and headquarters facilities, the bandwidth interconnecting many corporate and industrial sites in the world is much lower. Even with DSL, Frame Relay or other broadband technologies, WAN connections are slow relative to a LAN. For example, 1 Mbps DSL service offers only 1/100th the bandwidth of Fast Ethernet and 1/1,000th of what is available using Gigabit Ethernet.
While some places might have high bandwidth backbone networks, such as the Metro Ethernet available in South Korea and Japan, the latency and bandwidth issues persist whenever data needs to travel outside areas with such networks. For example, a Japanese manufacturer with plants in Japan and the U.S. might need to send CAD/CAM files back and forth between plants. The latency from Japan to the East Coast of the U.S. might be as high as 200 ms and trans-Pacific bandwidth can be expensive and limited.
WAN network bandwidth limits almost always impact client-server application throughput across the WAN, but more bandwidth can be bought. With latency, lower latency cannot be bought if it would require faster than light communications. In some cases, network latency is the bottleneck on performance or throughput. This is often the case with window-based transport protocols such as TCP or a request-response protocol such as the Common Internet File System (CIFS) protocol or the Network File System (NFS) protocol. High network latency particularly slows down “chatty” applications, even if the actual amounts of data transmitted in each transaction are not large. “Chatty” applications are those in which client-server interactions involve many back-and-forth steps that might not even depend on each other. Adding bandwidth (or compressing data) does not improve the throughput of these protocols/applications when the round-trip time exceeds some critical point and once the latency reaches that critical point, throughput decays quickly.
This phenomenon can be understood intuitively: the rate of work that can be performed by a client-server application that executes serialized steps to accomplish its tasks is inversely proportional to the round-trip time between the client and the server. If the client-server application is bottlenecked in a serialized computation (i.e., it is “chatty”), then increasing the round-trip by a factor of two causes the throughput to decrease by a factor of two because it takes twice as long to perform each step (while the client waits for the server and vice versa).
More generally, the throughput of client-server applications that are not necessarily chatty but run over a window-based protocol (such as TCP) can also suffer from a similar fate. This can be modeled with a simple equation that accounts for the round-trip time (RTT) and the protocol window (W). The window defines how much data the sender can transmit before requiring receipt of an acknowledgement from the receiver. Once a window's worth of data is sent, the sender must wait until it hears from the receiver. Since it takes a round-trip time to receive the acknowledgement from the receiver, the rate at which data can be sent is simply the window size divided by the round trip time:T=W/RTT
The optimal choice of window size depends on a number of factors. To perform well across a range of network conditions, a TCP device attempts to adapt its window to the underlying capacity of the network. So, if the underlying bottleneck bandwidth (or the TCP sender's share of the bandwidth) is roughly B bits per second, then a TCP device attempts to set its window to B×RTT, and the throughput, T, would be:T=(B×RTT)/RTT=B 
In other words, the throughput would be equal to the available rate. Unfortunately, there are often other constraints. Many protocols, such as TCP and CIFS, have an upper bound on the window size that is built into the protocol. For example, the maximum request size in CIFS is 64 KB and in the original TCP protocol, the maximum window size was limited by the fact that the advertised window field in the protocol header is 16 bits, limiting the window also to 64 KB. While modern TCP stacks implement the window scaling method in RFC 1323 to overcome this problem, there are still many legacy TCP implementations that do not negotiate scaled windows, and there are more protocols such as CIFS that have application-level limits on top of the TCP window limit. So, in practice, the throughput is actually limited by the maximum window size (MWS)T=min(B×RTT,MWS)/RTT<=B 
Even worse, there is an additional constraint on throughput that is fundamental to the congestion control algorithm designed into TCP. This flaw turns out to be non-negligible in wide-area networks where bandwidth is above a few megabits and is probably the key reason why enterprises often fail to see marked performance improvements of individual applications after substantial bandwidth upgrades.
Essentially, this problem stems from conflicting goals of the TCP congestion control algorithm that are exacerbated in a high-delay environment. Namely, upon detecting packet loss, a TCP device reacts quickly and significantly to err on the side of safety (i.e., to prevent a set of TCP connections from overloading and congesting the network). Yet, to probe for available bandwidth, a TCP device will dynamically adjust its sending rate and continually push the network into momentary periods of congestion that cause packet loss to detect bandwidth limits. In short, a TCP device continually sends the network into congestion then aggressively backs off. In a high-latency environment, the slow reaction time results in throughput limitations.
An equation was derived in the late 1990's that models the behavior of a network as a function of the packet loss rate that TCP induces and that equation is:CWS=1.2×S/sqrt(p)
As indicated by that equation, the average congestion window size (CWS) is roughly determined by the packet size (S) and the loss rate (p). Taking this into account, the actual throughput of a client-server application running over TCP is:T=W/RTT=min(MWS,CWS,B×RTT)/RTT
With a T3 line, the TCP throughput starts out at the available line rate (45 Mb/s) at low latencies, but at higher latencies the throughput begins to decay rapidly (in fact, hyperbolically). This effect is so dramatic that at a 100 ms delay (i.e., a typical cross-country link), TCP throughput is only 4.5 Mb/s of the 45 Mb/s link.
Under such conditions, application performance does not always increase when additional bandwidth is added. If the round trip time (RTT) is greater than a critical point (just 15 ms or so in this example) then increasing the bandwidth of the link will only marginally improve throughput at higher latency and at even higher latencies, throughput is not increased at all with increases in bandwidth. In environments with relatively low loss rates and normal WAN latencies, throughput can be dramatically limited.
Existing Approaches to Overcoming WAN Throughput Problems
Given the high costs and performance challenges of WAN-based enterprise computing and communication, many approaches have been proposed for dealing with these problems.
Perhaps the simplest approach to dealing with performance is to simply upgrade the available bandwidth in the network. Of course this is the most direct solution, but it is not always the most effective approach. First of all, contrary to popular belief, bandwidth is not free and the costs add up quickly for large enterprises that may have hundreds of offices. Second, as discussed earlier, adding bandwidth does not necessarily improve throughput. Third, in some places adding more bandwidth is not possible, especially across international sites, in remote areas, or where it is simply too expensive to justify.
Another approach is to embed intelligence in the applications themselves, e.g., to exploit that fact that data often changes in incremental ways so that the application can be designed to send just incremental updates to between clients and servers. Usually, this type of approach employs some sort of versioning system to keep track of version numbers of files (or data objects) so that differences between versioned data can be sent between application components across the network. For example, some content management systems have this capability and storage backup software generally employs this basic approach. However, these systems do not deal with scenarios where data is manipulated outside of their domain. For example, when a file is renamed and re-entered into the system the changes between the old and new versions are not captured. Likewise, when data flows between distinct applications (e.g., a file is copied out of a content management system and into a file system), versioning cannot be carried out between the different components.
This approach of managing versions and communicating updates can be viewed as one specific (and application-specific) approach to compression. More generally, data compression systems can be utilized to ameliorate network bandwidth bottlenecks. Compression is a process of representing one set of data with another set of data wherein the second set of data is, on average, a smaller number of bits than the first set of data, such that the first set of data, or at least a sufficient approximation of the first set of data, can be recovered from an inverse of the compression process in most cases. Compression allows for more efficient use of a limited bandwidth and might result in less latency, but in some cases, no latency improvement occurs. In some cases, compression might add to the latency, if time is needed to compress data after the request is made and time is needed to decompress the data after it is received. This may be able to be improved if the data can be compressed ahead of time, before the request is made, but that may not be feasible if the data is not necessarily available ahead of time for compression, or if the volume of data from which the request will be served is too large relative to the amount of data likely to be used.
One way to deploy compression is to embed it in applications. For example, a Web server can compress the HTML pages it returns before delivering them across the network to end clients. Another approach is to deploy compression in the network without having to modify the applications. For many years, network devices have included compression options as features (e.g., in routers, modems, dedicated compression devices, etc) [D. Rand, “The PPP Compression Control Protocol (CCP)”, Request-for-Comments 1962, June 1996]. This is a reasonable thing to do, but the effectiveness is limited. Most methods of lossless data compression typically reduce the amount of data (i.e., bandwidth) by a factor of 1.5 to 4, depending on the inherent redundancy present. While helpful, it is not enough to dramatically change performance if the amount of data being sent is large or similar data is sent repeatedly, perhaps over longer time scales. Also, when performance is limited by network latency, compressing the underlying data will have little or no impact.
Rather than compress the data, another approach to working around WAN bottlenecks is to replicate servers and server data in local servers for quick access. This approach in particular addresses the network latency problem because a client in a remote site can now interact with a local server rather than a remote server. There are several methods available to enterprises to store redundant copies of data in replicated file systems, redundant or local storage servers, or by using any number of distributed file systems. The challenge with this kind of approach is the basic problem of managing the ever-exploding amount of data, which requires scaling up storage, application and file servers in many places, and trying to make sure that the files people need are indeed available where and when they are needed. Moreover, these approaches are generally non-transparent, meaning the clients and servers must be modified to implement and interact with the agents and/or devices that perform the replication function. For example, if a file server is replicated to a remote branch, the server must be configured to send updates to the replica and certain clients must be configured to interact with the replica while others need to be configured to interact with the original server.
Rather than replicate servers, another approach is to deploy transport-level or application-level devices called “proxies”, which function as performance-enhancing intermediaries between the client and the server. In this case, a proxy is the terminus for the client connection and initiates another connection to the server on behalf of the client. Alternatively, the proxy connects to one or more other proxies that in turn connect to the server. Each proxy may forward, modify, or otherwise transform the transactions as they flow from the client to the server and vice versa. Examples of proxies include (1) Web proxies that enhance performance through caching or enhance security by controlling access to servers, (2) mail relays that forward mail from a client to another mail server, (3) DNS relays that cache DNS name resolutions, and so forth.
One problem that must be overcome when deploying proxies is that of directing client requests to the proxy instead of to the destination server. One mechanism for accomplishing this is to configure each client host or process with the network address information of the proxy. This requires that the client application have an explicit proxy capability, whereby the client can be configured to direct requests to the proxy instead of to the server. In addition, this type of deployment requires that all clients must be explicitly configured and that can be an administrative burden on a network administrator.
One way around the problems of explicit proxy configuration is to deploy a transparent proxy. The presence of the transparent proxy is not made explicitly known to the client process, so all client requests proceed along the network path towards the server as they would have if there were no transparent proxy. This might be done by placing the transparent proxy host in the network path between the client and the server. An L4 switch is then employed so the proxy host can intercept client connections and handle the requests via the proxy. For example, the L4 switch could be configured so that all Web connections (i.e., TCP connections on port 80) are routed to a local proxy process. The local proxy process can then perform operations on behalf of the server. For example, the local proxy process could respond to the request using information from its local cache. When intercepting the connection, the L4 switch performs NAT (network address translation) so the connection appears to the client as having been terminated at the origin server, even though the client communicates directly with the proxy. In this manner, the benefits of a proxy can be realized without the need for explicit client configuration.
Some benefits of a transparent proxy require that a proxy pair exist in the network path. For example, if a proxy is used to transform data in some way, a second proxy preferably untransforms the data. For example, where traffic between a client and a server is to be compressed or encrypted for transport over a portion of the network path between the client and the server, a proxy on one side of that portion would compress or encrypt data before it flows over that portion and a proxy on the other side of that portion would uncompress or decrypt the data and send it along the network path, thereby providing for transparent transformation of data flowing between the client and the server.
For actions that require a proxy pair, preferably both proxies in the proxy pair do not perform a transformation unless they can be assured of the existence and operation of the other proxy in the proxy pair. Where each proxy must be explicitly configured with indications of the pairs to which it belongs and to the identity of the other members of those pairs, the administrative burden on a network administrator might well make some operations infeasible if they require proxy pairs. Even where a proxy is interposed in a network and gets all of the traffic from a client or server, it still must discover the other member for each proxy pair the proxy needs, if the proxy is to perform actions that require proxy pairs.
With a proxy situated between the client and server, the performance impairments of network latency can be addressed by having the proxy cache data. Caching is a process of storing previously transmitted results in the hopes that the user will request the results again and receive a response more quickly from the cache than if the results had to come from the original provider. Caching also provides some help in mitigating both latency and bandwidth bottlenecks, but in some situations it does not help much. For example, where a single processor is retrieving data from memory it controls and does so in a repetitive fashion, as might be the case when reading processor instructions from memory, caching can greatly speed a processor's tasks. Similarly, file systems have employed caching mechanisms to store recently accessed disk blocks in host memory so that subsequent accesses to cached blocks are completed much faster than reading them in from disk again as in BSD Fast File System [McKusick, et al., “A Fast File System for BSD”, ACM Transactions on Computer Systems, Vol. 2(3), 1984], the Log-based File System [Rosenblum and Ousterhout, “The Design and Implementation of a Log-structured File System”, ACM Transactions on Computer Systems, Vol. 10(1), 1992], etc.
In a typical cache arrangement, a requestor requests data from some memory, device or the like and the results are provided to the requestor and stored in a cache having a faster response time than the original device supplying the data. Then, when the requestor requests that data again, if it is still in the cache, the cache can return the data in response to the request before the original device could have returned it and the request is satisfied that much sooner.
Caching has its difficulties, one of which is that the data might change at the source and the cache would then be supplying “stale” data to the requestor. This is the “cache consistency” problem. Because of this, caches are often “read only” requiring that changes to data be transmitted through the cache back to the source in a “write-through” fashion. Another problem with caching is that the original source of the data might want to track usage of data and would not be aware of uses that were served from the cache as opposed to from the original source. For example, where a Web server is remote from a number of computers running Web browsers that are “pointed to” that Web server, the Web browsers might cache Web pages from that site as they are viewed, to avoid delays that might occur in downloading the Web page again. While this would improve performance in many cases, and reduce the load on the Web server, the Web server operator might try to track the total number of “page views” but would be ignorant of those served by the cache. In some cases, an Internet service provider might operate the cache remote from the browsers and provide cached content for a large number of browsers, so a Web server operator might even miss unique users entirely.
Additionally, the mechanism underlying Web caching provides only a loose model for consistency between the origin data and the cached data. Generally, Web data is cached for a period of time based on heuristics or hints in the transactions independent of changes to the origin data. This means that cached Web data can occasionally become inconsistent with the origin server and such inconsistencies are simply tolerated by Web site operators, service providers, and users as a reasonable performance trade-off. Unfortunately, this model of loose consistency is entirely inappropriate for general client-server communication like networked file systems. When a client interacts with a file server, the consistency model must be wholly correct and accurate to ensure proper operation of the application using the file system.
Where loose consistency can be tolerated, caching can work remarkably well. For example, the Domain Name System (DNS), dating back to the early 1980's, employs caching extensively to provide performance and scalability across the wide area. In this context, providing only loose consistency semantics has proven adequate. In DNS, each “name server” manages a stored dataset that represents so-called “resource records” (RR). While DNS is most commonly used to store and manage the mappings from host names to host addresses in the Internet (and vice versa), the original DNS design and its specification allow resource records to contain arbitrary data. In this model, clients send queries to servers to retrieve data from the stored data set managed by a particular server. Clients can also send queries to relays, which act as proxies and cache portions of master name servers' stored datasets. A query can be “recursive”, which causes the relay to recursively perform the query on behalf of the client. In turn, the relay can communicate with another relay and so forth until the master server is ultimately contacted. If any relay on the path from the client to the server has data in its cache that would satisfy the request, then it can return that data back to the requestor.
Some solutions to network responsiveness deal with the problem at the file system or at network layers. One proposed solution is the use of a low-bandwidth network file system, such as that described in Muthitacharoen, A., et al., “A Low-Bandwidth Network File System”, in Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01), pp. 174-187 (Chateau Lake Louise, Banff, Canada, October 2001) (in vol. 35, 5 of ACM SIGOPS Operating Systems Review, ACM Press). In that system, called LBFS, clients employ “whole file” caching whereby upon a file open operation, the client fetches all the data in the file from the server, then operates on the locally cached copy of the file data. If the client makes changes to the file, those changes are propagated back to the server when the client closes the file. To optimize these transfers, LBFS replaces pieces of the file with hashes, and the recipient uses the hashes in conjunction with a local file store to resolve the hashes to the original portions of the file.
Such systems have limitations in that they are tied to file systems and generally require modification of the clients and servers between which responsiveness is to be improved. Furthermore, the hashing scheme operates over blocks of relatively large (average) size, which works poorly when files are subject to fine-grained changes over time. Finally, LBFS is by design intimately tied to a network file system protocol. It is not able to optimize or accelerate other types of client-server transactions, e.g., e-mail, Web, streaming media, and so forth. The hashes could also collide as the hashes are not provably unique.
Another proposed solution is suggested by Spring, N., et al., “A Protocol-Independent Technique for Eliminating Redundant Network Traffic”, in Proceedings of ACM SIGCOMM (August 2000). As described in that reference, network packets that are similar to recently transmitted packets can be reduced in size by identifying repeated strings and replacing the repeated strings with tokens to be resolved from a shared packet cache at either end of a network link. This approach, while beneficial, has a number of shortcomings. Because it operates solely on individual packets, the performance gains that accrue are limited by the ratio of the packet payload size to the packet header (since the packet header is generally not compressible using the described technique). Also, because the mechanism is implemented at the packet level, it only applies to regions of the network where two ends of a communicating path have been configured with the device. This configuration can be difficult to achieve, and may be impractical in certain environments. Also, by caching network packets using a relatively small memory-based cache with a first-in first-out replacement policy (without the aid of, for instance, a large disk-based backing store), the efficacy of the approach is limited to detecting and exploiting communication redundancies that are fairly localized in time.
Cache consistency in the context of network file systems has been studied. The primary challenge is to provide a consistent view of a file to multiple clients when these clients read and write the file concurrently. When multiple clients access a file for reading and at least one client accesses the same file for writing, a condition called “concurrent write sharing” occurs and measures must be taken to guarantee that reading clients do not access stale data after a writing client updates the file.
In the original Network File System (NFS) [Sandberg et al., “Design and Implementation of the Sun Network Filesystem”, In Proc. of the Summer 1985 USENIX Conference, 1985], caching is used to store disk blocks that were accessed across the network sometime in the past. An agent at the client maintains a cache of file system blocks and, to provide consistency, their last modification time. Whenever the client reads a block, the agent at the client checks to determine if the requested block is in its local cache. If it is and the last modification time is less than some configurable parameter (to provide a medium level of time-based consistency), then the block is returned by the agent. If the modification time is greater than the parameter, then the last-modification time for the file is fetched from the server. If that time is the same as the last modification time of the data in the cache, then the request is returned from the cache. Otherwise, the file has been modified so all blocks of that file present in the local cache are flushed and the read request is sent to the server. To provide tighter consistency semantics, NFS can employ locking via the NFS Lock Manager (NLM). Under this configuration, when the agent at the client detects the locking condition, it disables caching and thus forces all requests to be serviced at the server, thereby ensuring strong consistency.
When blocks are not present in the local cache, NFS attempts to combat latency with the well-known “read-ahead” algorithm, which dates back to at least the early 1970's as it was employed in the Multics I/O System [Feiertag and Organick, “The Multics Input/Output System”, Third ACM Symposium on Operating System Principles, October 1971]. The read-ahead algorithm exploits the observation that clients often open files and sequentially read each block. That is, when a client accesses block k, it is likely in the future to access block k+1. In read-ahead, a process or agent fetches blocks ahead of the client's request and stores those blocks in the cache in anticipation of the client's forthcoming request. In this fashion, NFS can mask the latency of fetching blocks from a server when the read-ahead turns out to successfully predict the client read patterns. Read-ahead is widely deployed in modern file systems.
In the Andrew File System (AFS) [Howard, “An Overview of the Andrew File System”, In Proc. of the USENIX Winter Technical Conference, February 1988], “whole-file” caching is used instead of block-based caching. Here, when a client opens a file, an agent at the client checks to see if the file is resident in its local disk cache. If it is, it checks with the server to see if the cached file is valid (i.e., that there have not been any modifications since the file was cached). If not (or if the file was not in the cache to begin with), a new version of the file is fetched from the server and stored in the cache. All client file activity is then intercepted by the agent at the client and operations are performed on the cached copy of the file. When the client closes the file, any modifications are written back to the server. This approach provides only “close-to-open” consistency because changes by multiple clients to the same file are only serialized and written back to the server on each file close operation.
Another mechanism called “opportunistic locking” was employed by the Server Message Block (SMB) Protocol, now called CIFS, to provide consistency. In this approach, when a file is opened the client (or client agent) can request an opportunistic lock or oplock associated with the file. If the server grants the oplock, then the client can assume no modifications will be made to file during the time the lock is held. If another client attempts to open the file for writing (i.e., concurrent write sharing arises), then the server breaks the oplock previously granted to the first client, then grants the second client write access to the file. Given this condition, the first client is forced to send all reads to the server for the files for which it does not hold an oplock. A similar mechanism was employed in the Sprite distributed file system, where the server would notify all relevant clients when it detected concurrent write sharing [Nelson, Welch, and Ousterhout, “Caching in the Sprite Network File System”, ACM Transactions on Computer Systems, 6(1), February, 1988].
When consistency mechanisms are combined with network caching, a great deal of complexity arises. For example, if a data caching architecture such as that used by DNS or the Web were applied to file systems, it would have to include a consistency protocol that could manage concurrent write sharing conditions when they arise. In this model, each node, or network cache, in the system contains a cache of file data that can be accessed by different clients. The file data in the cache is indexed by file identification information, relating the image of data in the cache to the server and file it came from. Just like NFS, a cache could enhance performance in certain cases by using read-ahead to retrieve file data ahead of a client's request and storing said retrieved data in the cache. Upon detecting when concurrent write sharing, such a system could force all reads and writes to be synchronized at a single caching node, thereby assuring consistency. This approach is burdened by a great deal of complexity in managing consistency across all the caches in the system. Moreover, the system's concurrency model assumes that all file activity is managed by its caches; if a client modifies data directly on the server, consistency errors could arise. Also, its ability to overcome network latency for client accesses to data that is not resident in the cache is limited to performing file-based read-ahead. For example, in NFS, a client that opens a file must look up each component of the path (once per round-trip) to ultimately locate the desired file handle and file-based read-ahead does nothing eliminate these round-trips. Finally, the system must perform complex protocol conversions between the native protocols that the clients and servers speak and the systems internal caching protocols, effectively requiring that the system replicate the functionality of a server (to interoperate with a client) and a client (to interoperate with a server).
A different approach to dealing with network latency when clients access data that is not in the cache is to predict file access patterns. A number of research publications describe approaches that attempt to predict the next file (or files) a client might access based on the files it is current accessing and has accessed in the past, see [Amer. et al., “File Access Prediction with Adjustable Accuracy”, In Proc. of the International Performance Conference on Computers and Communication, April 2002], [Lei and Duchamp, “An Analytical Approach to File Prefetching”, In Proc. of the 1997 Annual USENIX Conference, January 1997], [Griffioen and Appleton, “Reducing File System Latency using a Predictive Approach”, In Proc. of the 1994 Summer USENIX Conference, June 1994], [Kroeger and Long, “The Case for Efficient File Access Pattern Modeling”, In Proc. of the Seventh Workshop on Hot Topics in Operating Systems, March 1999]. Based on these prediction models, these systems pre-fetch the predicted files by reading them into a cache. Unfortunately, this approach presumes the existence of a cache and thus entails the complexities and difficulties of cache coherency.
In the context of the World-Wide Web, other research has applied this prediction concept to Web objects [Padmanabhan and Mogul, “Using Predictive Prefetching to Improve World Wide Web Latency”, ACM SIGCOMM, Computer Communication Review 26(3), July 1996]. In this approach, the server keeps track of client access patterns and passes this information as a hint to the client. The client in turn can choose to pre-fetch into its cache the URLs that correspond to the hinted objects. Again, this approach presumes the existence of a cache, and can be deployed without disrupting the semantics of the Web protocols only because the Web is generally read-only and does not require strong consistency.
Unfortunately, while many of the above techniques solve some aspects of WAN performance problems, they still have some shortcomings. In view of the above problems and the limitations with existing solutions, improvements can be made in how and when data is transported for transactions over a network, along with mechanisms for implementing such transport.