Local Area Network (LAN) communication is characterized by generous bandwidths, low latencies and considerable enterprise control over the network. By contrast, Wide Area Networks (WANs) often have lower bandwidths and higher latencies than LANs and often have a measure of network control that is outside the enterprise for which the WAN is being used.
Wide-area client-server applications are a critical part of almost any large enterprise. A WAN might be used to provide access to widely used and critical infrastructure, such as file servers, mail servers and networked storage. This access most often has very poor throughput when compared to the performance across a LAN. Whether an enterprise is taking a centralized approach or a distributed approach, high performance communication across the WAN is essential in order to minimize costs and maximize productivity. Enterprise IT managers today typically take one of two approaches to compensate for the performance challenges inherent in WANs:                1. If the IT infrastructure is decentralized and they intend to keep it that way, corporate network and server managers typically have to deploy local file servers, data storage, mail servers, and backup systems with some redundancy in each remote office to ensure fast and reliable access to critical data and applications at each office. They also may maintain over-provisioned WAN links in order to enable reasonable levels of performance for file transfers and similar data-intensive tasks.        2. If the IT infrastructure is already centralized or semi-centralized, the enterprise will be faced with a constant demand for “more bandwidth” to remote sites in an effort to improve the performance for distributed users.Causes of Poor WAN Throughput        
The two primary causes of the slow throughput on WANs are well known: high delay (or latency) and limited bandwidth. The “bandwidth” of a network of channel refers to measure of the number of bits that can be transmitted over a link or path per unit of time. “Latency” refers to a measure of the amount of time that transpires while the bits traverse the network, e.g., the time it takes a given bit transmitted from the sender to reach the destination. “Round-trip time” refers to the sum of the “source-to-destination” latency and the “destination-to-source” latency. If the underlying paths are asymmetric, the round-trip latency might be different than twice a one-way latency. The term “throughput” is sometimes confused with bandwidth but refers to a measure of an attained transfer rate that a client-server application, protocol, etc. achieves over a network path. Throughput is typically less than the available network bandwidth.
The speed of light, a fundamental and fixed constant, implies that information transmitted across a network always incurs some nonzero latency as it travels from the source to the destination. In practical terms, this means that sending a packet from Silicon Valley to New York and back could never occur faster than about 30 milliseconds (ms), the time information in an electromagnetic signal would take to travel that distance in a direct path cross-country. In reality, this cross-country round trip time is more in the range of 100 ms or so, as signals in fiber or copper do not always travel at the speed of light in a vacuum and packets incur processing delays through each switch and router. This amount of latency is quite significant as it is at least two orders of magnitude higher than typical sub-millisecond LAN latencies.
Other round-trips might have more latency. Round trips from the West Coast of the U.S. to Europe can be in the range of 100-200 ms, and some links using geo-stationary satellites into remote sites can have latencies in the 500-800 ms range. With latencies higher than about 50 ms, many client-server protocols and applications will function poorly relative to a LAN as those protocols and applications expect very low latency.
While many employees routinely depend upon Fast Ethernet (100 Mbps) or Gigabit Ethernet (1 Gbps) within most corporate sites and headquarters facilities, the bandwidth interconnecting many corporate and industrial sites in the world is much lower. Even with DSL, Frame Relay or other broadband technologies, WAN connections are slow relative to a LAN. For example, 1 Mbps DSL service offers only 1/100th the bandwidth of Fast Ethernet and 1/1,000th of what is available using Gigabit Ethernet.
While some places might have high bandwidth backbone networks, such as the Metro Ethernet available in South Korea and Japan, the latency and bandwidth issues persist whenever data needs to travel outside areas with such networks. For example, a Japanese manufacturer with plants in Japan and the U.S. might needs to send CAD/CAM files back and forth between plants. The latency from Japan to the East Coast of the U.S. might be as high as 200 ms and trans-Pacific bandwidth can be expensive and limited.
WAN network bandwidth limits almost always impact client-server application throughput across the WAN, but more bandwidth can be bought. With latency, lower latency cannot be bought if it would require faster than light communications. In some cases, network latency is the bottleneck on performance or throughput. This is often the case with window-based transport protocols such as TCP or a request-response protocol such as the Common Internet File System (CIFS) protocol or the Network File System (NFS) protocol. High network latency particularly slows down “chatty” applications, even if the actual amounts of data transmitted in each transaction are not large. “Chatty” applications are those in which client-server interactions involve many back-and-forth steps that might not even depend on each other. Adding bandwidth (or compressing data) does not improve the throughput of these protocols/applications when the round-trip time exceeds some critical point and once the latency reaches that critical point, throughput decays quickly.
This phenomenon can be understood intuitively: the rate of work that can be performed by a client-server application that executes serialized steps to accomplish its tasks is inversely proportional to the round-trip time between the client and the server. If the client-server application is bottlenecked in a serialized computation (i.e., it is “chatty”), then increasing the round-trip by a factor of two causes the throughput to decrease by a factor of two because it takes twice as long to perform each step (while the client waits for the server and vice versa).
More generally, the throughput of client-server applications that are not necessarily chatty but run over a window-based protocol (such as TCP) can also suffer from a similar fate. This can be modeled with a simple equation that accounts for the round-trip time (RTT) and the protocol window (W). The window defines how much data the sender can transmit before requiring receipt of an acknowledgement from the receiver. Once a window's worth of data is sent, the sender must wait until it hears from the receiver. Since it takes a round-trip time to receive the acknowledgement from the receiver, the rate at which data can be sent is simply the window size divided by the round trip time:T=W/RTT 
The optimal choice of window size depends on a number of factors. To perform well across a range of network conditions, a TCP device attempts to adapt its window to the underlying capacity of the network. So, if the underlying bottleneck bandwidth (or the TCP sender's share of the bandwidth) is roughly B bits per second, then a TCP device attempts to set its window to B×RTT, and the throughput, T, would be:T=(B×RTT)/RTT=B 
In other words, the throughput would be equal to the available rate. Unfortunately, there are often other constraints. Many protocols, such as TCP and CIFS, have an upper bound on the window size that is built into the protocol. For example, the maximum request size in CIFS is 64 KB and in the original TCP protocol, the maximum window size was limited by the fact that the advertised window field in the protocol header is 16 bits, limiting the window also to 64 KB. While modern TCP stacks implement the window scaling method in RFC 1323 to overcome this problem, there are still many legacy TCP implementations that do not negotiate scaled windows, and there are more protocols such as CIFS that have application-level limits on top of the TCP window limit. So, in practice, the throughput is actually limited by the maximum window size (MWS)T=min(B×RTT,MWS)/RTT<=B 
Even worse, there is an additional constraint on throughput that is fundamental to the congestion control algorithm designed into TCP. This flaw turns out to be non-negligible in wide-area networks where bandwidth is above a few megabits and is probably the key reason why enterprises often fail to see marked performance improvements of individual applications after substantial bandwidth upgrades.
Essentially, this problem stems from conflicting goals of the TCP congestion control algorithm that are exacerbated in a high-delay environment. Namely, upon detecting packet loss, a TCP device reacts quickly and significantly to err on the side of safety (i.e., to prevent a set of TCP connections from overloading and congesting the network). Yet, to probe for available bandwidth, a TCP device will dynamically adjust its sending rate and continually push the network into momentary periods of congestion that cause packet loss to detect bandwidth limits. In short, a TCP device continually sends the network into congestion then aggressively backs off. In a high-latency enviromnent, the slow reaction time results in throughput limitations.
An equation was derived in the late 1990's that models the behavior of a network as a function of the packet loss rate that TCP induces and that equation is:CWS=1.2×S/sqrt(p)
As indicated by that equation, the average congestion window size (CWS) is roughly determined by the packet size (S) and the loss rate (p). Taking this into account, the actual throughput of a client-server application running over TCP is:T=W/RTT=min(MWS,CWS,B×RTT)/RTT 
FIG. 1 is a graph that illustrates this problem from a very practical perspective. That graph shows the performance of a TCP data transfer when the network is experiencing a low degree of network loss (less than 1/10 of 1 percent) for increasing amounts of latency. The bottom curve represents the TCP throughput achievable from a T1 line, which is roughly equal to the available bandwidth (1.544 Mb/s) all the way up to 100 ms latencies. The top curve, on the other hand, illustrates the performance impact of the protocol window at higher bandwidths. With a T3 line, the TCP throughput starts out at the available line rate (45 Mb/s) at low latencies, but at higher latencies the throughput begins to decay rapidly (in fact, hyperbolically). This effect is so dramatic that at a 100 ms delay (i.e., a typical cross-country link), TCP throughput is only 4.5 Mb/s of the 45 Mb/s link.
Under such conditions, application performance does not always increase when additional bandwidth is added. As FIG. 1 shows, if the round trip time (RTT) is greater than a critical point (just 15 ms or so in this example) then increasing the bandwidth of the link will only marginally improve throughput at higher latency and at even higher latencies, throughput is not increased at all with increases in bandwidth.
FIG. 2 graphs a surface of throughput model derived above, presuming a TCP transfer over a 45 Mb/s T3 link. The surface plots throughput as a function of both round-trip times and loss rates. This graph shows that both increasing loss and increasing latency impair performance. While latency has the more dramatic impact, they combine to severely impact performance. In environments with relatively low loss rates and normal WAN latencies, throughput can be dramatically limited.
Existing Approaches to Overcoming WAN Throughput Problems
Given the high costs and performance challenges of WAN-based enterprise computing and communication, many approaches have been proposed for dealing with these problems.
Perhaps the simplest approach to dealing with performance is to simply upgrade the available bandwidth in the network. Of course this is the most direct solution, but it is not always the most effective approach. First of all, contrary to popular belief, bandwidth is not free and the costs add up quickly for large enterprises that may have hundreds of offices. Second, as discussed earlier, adding bandwidth does not necessarily improve throughput. Third, in some places adding more bandwidth is not possible, especially across international sites, in remote areas, or where it is simply too expensive to justify.
Another approach is to embed intelligence in the applications themselves, e.g., to exploit that fact that data often changes in incremental ways so that the application can be designed to send just incremental updates to between clients and servers. Usually, this type of approach employs some sort of versioning system to keep track of version numbers of files (or data objects) so that differences between versioned data can be sent between application components across the network. For example, some content management systems have this capability and storage backup software generally employs this basic approach. However, these systems do not deal with scenarios where data is manipulated outside of their domain. For example, when a file is renamed and re-entered into the system the changes between the old and new versions are not captured. Likewise, when data flows between distinct applications (e.g., a file is copied out of a content management system and into a file system), versioning cannot be carried out between the different components.
This approach of managing versions and communicating updates can be viewed as one specific (and application-specific) approach to compression. More generally, data compression systems can be utilized to ameliorate network bandwidth bottlenecks. Compression is a process of representing one set of data with another set of data wherein the second set of data is, on average, a smaller number of bits than the first set of data, such that the first set of data, or at least a sufficient approximation of the first set of data, can be recovered from an inverse of the compression process in most cases. Compression allows for more efficient use of a limited bandwidth and might result in less latency, but in some cases, no latency improvement occurs. In some cases, compression might add to the latency, if time is needed to compress data after the request is made and time is needed to decompress the data after it is received. This may be able to be improved if the data can be compressed ahead of time, before the request is made, but that may not be feasible if the data is not necessarily available ahead of time for compression, or if the volume of data from which the request will be served is too large relative to the amount of data likely to be used.
One way to deploy compression is to embed it in applications. For example, a Web server can compress the HTML pages it returns before delivering them across the network to end clients. Another approach is to deploy compression in the network without having to modify the applications. For many years, network devices have included compression options as features (e.g., in routers, modems, dedicated compression devices, etc) [D. Rand, “The PPP Compression Control Protocol (CCP)”, Request-for-Comments 1962, June 1996]. This is a reasonable thing to do, but the effectiveness is limited. Most methods of lossless data compression typically reduce the amount of data (i.e., bandwidth) by a factor of 1.5 to 4, depending on the inherent redundancy present. While helpful, it is not enough to dramatically change performance if the amount of data being sent is large or similar data is sent repeatedly, perhaps over longer time scales. Also, when performance is limited by network latency, compressing the underlying data will have little or no impact.
Rather than compress the data, another approach to working around WAN bottlenecks is to replicate servers and server data in local servers for quick access. This approach in particular addresses the network latency problem because a client in a remote site can now interact with a local server rather than a remote server. There are several methods available to enterprises to store redundant copies of data in replicated file systems, redundant or local storage servers, or by using any number of distributed file systems. The challenge with this kind of approach is the basic problem of managing the ever-exploding amount of data, which requires scaling up storage, application and file servers in many places, and trying to make sure that the files people need are indeed available where and when they are needed. Moreover, these approaches are generally non-transparent, meaning the clients and servers must be modified to implement and interact with the agents and/or devices that perform the replication function. For example, if a file server is replicated to a remote branch, the server must be configured to send updates to the replica and certain clients must be configured to interact with the replica while others need to be configured to interact with the original server.
Rather than replicate servers, another approach is to deploy transport-level or application-level devices called “proxies”, which function as performance-enhancing intermediaries between the client and the server. In this case, a proxy is the terminus for the client connection and initiates another connection to the server on behalf of the client. Alternatively, the proxy connects to one or more other proxies that in turn connect to the server. Each proxy may forward, modify, or otherwise transform the transactions as they flow from the client to the server and vice versa. Examples of proxies include (1) Web proxies that enhance performance through caching or enhance security by controlling access to servers, (2) mail relays that forward mail from a client to another mail server, (3) DNS relays that cache DNS name resolutions, and so forth.
With a proxy situated between the client and server, the performance impairments of network latency can be addressed by having the proxy cache data. Caching is a process of storing previously transmitted results in the hopes that the user will request the results again and receive a response more quickly from the cache than if the results had to come from the original provider. Caching also provides some help in mitigating both latency and bandwidth bottlenecks, but in some situations it does not help much. For example, where a single processor is retrieving data from memory it controls and does so in a repetitive fashion, as might be the case when reading processor instructions from memory, caching can greatly speed a processor's tasks. Similarly, file systems have employed caching mechanisms to store recently accessed disk blocks in host memory so that subsequent accesses to cached blocks are completed much faster than reading them in from disk again as in BSD Fast File System [McKusick, et al., “A Fast File System for BSD”, ACM Transactions on Computer Systems, Vol. 2(3), 1984], the Log-based File System [Rosenblum and Ousterhout, “The Design and Implementation of a Log-structured File System”, ACM Transactions on Computer Systems, Vol. 10(1), 1992], etc. In a typical cache arrangement, a requestor requests data from some memory, device or the like and the results are provided to the requestor and stored in a cache having a faster response time than the original device supplying the data. Then, when the requestor requests that data again, if it is still in the cache, the cache can return the data in response to the request before the original device could have returned it and the request is satisfied that much sooner.
Caching has its difficulties, one of which is that the data might change at the source and the cache would then be supplying “stale” data to the requestor. This is the “cache consistency” problem. Because of this, caches are often “read only” requiring that changes to data be transmitted through the cache back to the source in a “write-through” fashion. Another problem with caching is that the original source of the data might want to track usage of data and would not be aware of uses that were served from the cache as opposed to from the original source. For example, where a Web server is remote from a number of computers running Web browsers that are “pointed to” that Web server, the Web browsers might cache Web pages from that site as they are viewed, to avoid delays that might occur in downloading the Web page again. While this would improve performance in many cases, and reduce the load on the Web server, the Web server operator might try to track the total number of “page views” but would be ignorant of those served by the cache. In some cases, an Internet service provider might operate the cache remote from the browsers and provide cached content for a large number of browsers, so a Web server operator might even miss unique users entirely.
Where loose consistency can be tolerated, caching can work remarkably well. For example, the Domain Name System (DNS), dating back to the early 1980's, employs caching extensively to provide performance and scalability across the wide area. In this context, providing only loose consistency semantics has proven adequate. In DNS, each “name server” manages a stored dataset that represents so-called “resource records” (RR). While DNS is most commonly used to store and manage the mappings from host names to host addresses in the Internet (and vice versa), the original DNS design and its specification allow resource records to contain arbitrary data. In this model, clients send queries to servers to retrieve data from the stored data set managed by a particular server. Clients can also send queries to relays, which act as proxies and cache portions of master name servers' stored datasets. A query can be “recursive”, which causes the relay to recursively perform the query on behalf of the client. In turn, the relay can communicate with another relay and so forth until the master server is ultimately contacted. If any relay on the path from the client to the server has data in its cache that would satisfy the request, then it can return that data back to the requestor.
As with DNS, the mechanism underlying Web caching provides only a loose model for consistency between the origin data and the cached data. Generally, Web data is cached for a period of time based on heuristics or hints in the transactions independent of changes to the origin data. This means that cached Web data can occasionally become inconsistent with the origin server and such inconsistencies are simply tolerated by Web site operators, service providers, and users as a reasonable performance trade-off. Unfortunately, this model of loose consistency is entirely inappropriate for general client-server communication such as networked file systems. When a client interacts with a file server, the consistency model must be wholly correct and accurate to ensure proper operation of the application using the file system.
Cache consistency in the context of network file systems has been studied. The primary challenge is to provide a consistent view of a file to multiple clients when these clients read and write the file concurrently. When multiple clients access a file for reading and at least one client accesses the same file for writing, a condition called “concurrent write sharing” occurs and measures must be taken to guarantee that reading clients do not access stale data after a writing client updates the file.
In the original Network File System (NFS) [Sandberg et al., “Design and Implementation of the Sun Network Filesystem”, In Proc. of the Summer 1985 USENIX Conference, 1985], caching is used to store disk blocks that were accessed across the network sometime in the past. An agent at the client maintains a cache of file system blocks and, to provide consistency, their last modification time. Whenever the client reads a block, the agent at the client checks to determine if the requested block is in its local cache. If it is and the last modification time is less than some configurable parameter (to provide a medium level of time-based consistency), then the block is returned by the agent. If the modification time is greater than the parameter, then the last-modification time for the file is fetched from the server. If that time is the same as the last modification time of the data in the cache, then the request is returned from the cache. Otherwise, the file has been modified so all blocks of that file present in the local cache are flushed and the read request is sent to the server. To provide tighter consistency semantics, NFS can employ locking via the NFS Lock Manager (NLM). Under this configuration, when the agent at the client detects the locking condition, it disables caching and thus forces all requests to be serviced at the server, thereby ensuring strong consistency.
When blocks are not present in the local cache, NFS attempts to combat latency with the well-known “read-ahead” algorithm, which dates back to at least the early 1970's as it was employed in the Multics I/O System [Feiertag and Organick, “The Multics Input/Output System”, Third ACM Symposium on Operating System Principles, October 1971]. The read-ahead algorithm exploits the observation that clients often open files and sequentially read each block. That is, when a client accesses block k, it is likely in the future to access block k+1. In read-ahead, a process or agent fetches blocks ahead of the client's request and stores those blocks in the cache in anticipation of the client's forthcoming request. In this fashion, NFS can mask the latency of fetching blocks from a server when the read-ahead turns out to successfully predict the client read patterns. Read-ahead is widely deployed in modern file systems.
In the Andrew File System (AFS) [Howard, “An Overview of the Andrew File System”, In Proc. of the USENIX Winter Technical Conference, February 1988], “whole-file” caching is used instead of block-based caching. Here, when a client opens a file, an agent at the client checks to see if the file is resident in its local disk cache. If it is, it checks with the server to see if the cached file is valid (i.e., that there have not been any modifications since the file was cached). If not (or if the file was not in the cache to begin with), a new version of the file is fetched from the server and stored in the cache. All client file activity is then intercepted by the agent at the client and operations are performed on the cached copy of the file. When the client closes the file, any modifications are written back to the server. This approach provides only “close-to-open” consistency because changes by multiple clients to the same file are only serialized and written back to the server on each file close operation.
Another mechanism called “opportunistic locking” was employed by the Server Message Block (SMB) Protocol, now called CIFS, to provide consistency. In this approach, when a file is opened the client (or client agent) can request an opportunistic lock or oplock associated with the file. If the server grants the oplock, then the client can assume no modifications will be made to file during the time the lock is held. If another client attempts to open the file for writing (i.e., concurrent write sharing arises), then the server breaks the oplock previously granted to the first client, then grants the second client write access to the file. Given this condition, the first client is forced to send all reads to the server for the files for which it does not hold an oplock. A similar mechanism was employed in the Sprite distributed file system, where the server would notify all relevant clients when it detected concurrent write sharing [Nelson, Welch, and Ousterhout, “Caching in the Sprite Network File System”, ACM Transactions on Computer Systems, 6(1), February, 1988].
When consistency mechanisms are combined with network caching, a great deal of complexity arises. For example, if a data caching architecture such as that used by DNS or the Web were applied to file systems, it would have to include a consistency protocol that could manage concurrent write sharing conditions when they arise. In this model, each node, or network cache, in the system contains a cache of file data that can be accessed by different clients. The file data in the cache is indexed by file identification information, relating the image of data in the cache to the server and file it came from. Just like NFS, a cache could enhance performance in certain cases by using read-ahead to retrieve file data ahead of a client's request and storing said retrieved data in the cache. Upon detecting when concurrent write sharing, such a system could force all reads and writes to be synchronized at a single caching node, thereby assuring consistency. This approach is burdened by a great deal of complexity in managing consistency across all the caches in the system. Moreover, the system's concurrency model assumes that all file activity is managed by its caches; if a client modifies data directly on the server, consistency errors could arise. Also, its ability to overcome network latency for client accesses to data that is not resident in the cache is limited to performing file-based read-ahead. For example, in NFS, a client that opens a file must look up each component of the path (once per round-trip) to ultimately locate the desired file handle and file-based read-ahead does nothing eliminate these round-trips. Finally, the system must perform complex protocol conversions between the native protocols that the clients and servers speak and the systems internal caching protocols, effectively requiring that the system replicate the functionality of a server (to interoperate with a client) and a client (to interoperate with a server).
A different approach to dealing with network latency when clients access data that is not in the cache is to predict file access patterns. A number of research publications describe approaches that attempt to predict the next file (or files) a client might access based on the files it is current accessing and has accessed in the past, see [Amer. et al., “File Access Prediction with Adjustable Accuracy”, In Proc. of the International Performance Conference on Computers and Communication, April 2002], [Lei and Duchamp, “An Analytical Approach to File Prefetching”, In Proc. of the 1997 Annual USENIX Conference, January 1997], [Griffloen and Appleton, “Reducing File System Latency using a Predictive Approach”, In Proc. of the 1994 Summer USENIX Conference, June 1994], [Kroeger and Long, “The Case for Efficient File Access Pattern Modeling”, In Proc. of the Seventh Workshop on Hot Topics in Operating Systems, March 1999]. Based on these prediction models, these systems pre-fetch the predicted files by reading them into a cache. Unfortunately, this approach presumes the existence of a cache and thus entails the complexities and difficulties of cache coherency.
In the context of the World-wide Web, other research has applied this prediction concept to Web objects [Padmanabhan and Mogul, “Using Predictive Prefetching to Improve World Wide Web Latency”, ACM SIGCOMM, Computer Communication Review 26(3), July 1996]. In this approach, the server keeps track of client access patterns and passes this information as a hint to the client. The client in turn can choose to pre-fetch into its cache the URLs that correspond to the hinted objects. Again, this approach presumes the existence of a cache, and can be deployed without disrupting the semantics of the Web protocols only because the Web is generally read-only and does not require strong consistency.
Unfortunately, while many of the above techniques solve some aspects of WAN performance problems, they still have some shortcomings.