Compressing data streams by utilizing previously stored data is a known technique for reducing the size of data streams transmitted between two devices. In broad terms, a typical compression method entails two devices each storing copies of data that is sent between the devices. These stored copies of the data can be referred to as compression histories, as they represent a history of previously transmitted data that is then used to compress future data streams. When one of the devices is transmitting data to the other device, it searches its compression history for matches to the input data, and replaces the matched portions with references to the stored data in the transmission stream, reducing the size of the transmitted stream. The receiving device then uses the references in combination with its own compression history to reconstruct the uncompressed data stream. However, this general technique presents a number of challenges.
First, insufficiently long matches between input streams and compression histories can result in poor compression ratios, as well as increasing the processing overhead and number of times that a compression history must be accessed. These problems can be exacerbated in cases where a device is transmitting multiple data streams simultaneously, and thus may have several processes attempting to access a compression history simultaneously. These problems also may be accentuated in devices using a compression history stored on a medium, such as a disk, with long potential access latencies. To give a concrete example, a device sending a 2K file may find forty matching references scattered across its compression history, each reference matching a different 50 bytes of the file. This may require 40 separate iterations of a potentially complex matching algorithm, and 40 separate disk accesses to a compression history. By contrast, if a device finds a single matching reference for the entire 2K file, only a single disk access may be needed. Thus there is a need for systems and methods for efficiently creating locating long matches between an input stream and a compression history.
Second, when one device has sequences in its compression history that are not in a corresponding compression history on another device, inefficiencies may result. The device may replace portions of data streams with references to the sequences, and then be forced to retransmit the data stream as it discovers the other device does not have the referenced sequences. Further, the unshared sequences may occupy space in a compression history that could be used for other data. A number of methods may be used to synchronize compression histories with respect to data currently being transmitted between two devices. For example, each device may transmit information corresponding to the total number of bytes transmitted, received, and stored, as well as location identifiers identifying where the data has been stored. However, even if the compression histories are synchronized immediately following transmission of data, a number of events may cause the compression histories to subsequently diverge. For example, one device may run out of storage and be forced to overwrite one or more previously stored portions. Or one device may have a disk error or other hardware or software glitch which corrupts or removes one or more previously stored portions. Thus, there exists a need for improved systems and methods for efficiently synchronizing shared compression histories.
Third, in many implementations, compression histories and caching only provide benefits if the same data is repeatedly sent between the same two devices. This can be especially problematic in situations where two sites, each having a cluster of devices, may repeatedly communicate similar information, but there is no guarantee the information will pass through the same pair of devices. For example, two sites may each maintain a cluster of devices to accelerate communications between the sites. Cluster 1 may contain the devices A, B, and C, and cluster 2 may contain the devices X, Y, and Z. For example, devices A and Z may each maintain a compression history of a file sent between A and Z, but the next time the file is requested the request and response may pass through devices A and Y. Similarly, the next time the file is requested the request and response may pass through device B and device X. One potential solution is to organize the device clusters in a hierarchy so that all requests to a given cluster, network, or region pass through a gateway device. However, this solution may involve additional configuration and create network bottlenecks. Thus there exists a need for leveraging data previously transmitted between two devices to compress data streams transmitted between devices other than the original transmitters, without necessarily requiring explicit hierarchies.