The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
In general, compression refers to a process of encoding data with fewer bytes or bits than an un-encoded representation of the data would include. In network communications, data compression is typically performed through use of compression mechanisms, which take the data to be encoded as input and produce an encoded, or compressed, representation of the data as output. Typically, data transmitted in a network is compressed in order to reduce the network bandwidth that is consumed during the transmission of the data, which in turn increases the network throughput.
In order to achieve better network throughput, some network communication systems may use two or more compression mechanisms in succession to compress data that is to be transmitted over a network. For example, in one network compression system, data in a message is compressed by two compression mechanisms that are run one after the other before the message is transmitted to a receiver over the network. The first compression mechanism is a data redundancy elimination mechanism that uses a chunking algorithm to break the message into one or more data chunks. The data redundancy elimination mechanism then tries to match the one or more data chunks in the message to data chunks that have previously been sent to the same receiver. Each data chunk in the message that has previously been transmitted to the receiver is replaced in the message by a chunk identifier, where the chunk identifier is a short value on the basis of which the receiver is capable of determining the data in the data chunk upon receipt of the message. Data chunks in the message that are not matched to previously transmitted data chunks are left in the message with their original data. The resulting compressed message is then further compressed by a standard compression mechanism, such as a Lempel-Ziv (LZ) compression mechanism before being transmitted to the receiver.
One of the disadvantages of using multiple compression mechanisms to compress the same data is that applying multiple compression mechanisms on the data is computationally expensive with respect to the computer system on which the data compression mechanisms are executed. Further, applying multiple compression mechanisms to transmitted data at the sender computer system typically requires the receiver computer system to also apply multiple decompression mechanisms to the transmitted data, which is similarly computationally expensive. In order to compress and decompress the data, the data compression and decompression mechanisms use various computer system resources, such as, for example, CPU cycles, memory, and storage space. Typically, in order to achieve higher compression, the data compression mechanisms need to use more of these computing resources, and the data decompression mechanisms similarly need to use more computing resources to decompress highly compressed data. High usage of computing resources, however, impedes the performance of the computer systems performing the compression and decompression and may cause a significant latency in the response times from the perspective of a user that uses the computer systems to transmit the data over a network. Thus, while usage of multiple compression mechanisms may increase the compression of transmitted data and may improve network throughput, it may also impede computer system performance and may increase the latency of user response times because of a significant usage of computing resources.
For example, the LZ compression mechanism used in above-described network compression system is based on an algorithm that exploits the redundancy of the bytes in a message that is transmitted to the receiver. Since the LZ compression mechanism is byte-oriented, the computational effectiveness of the LZ compression mechanism with respect to usage of computing resources is dependant on the number of bytes received as an input. For example, the LZ compression mechanism would consume substantially the same amount of computing resources for some pairs of different messages that have the same number of bytes, where one of the messages in a given pair includes a significant amount of redundant bytes while the other message does not. However, while applying the LZ compression mechanism on the former message would be beneficial in that it would produce a better compression for the message (and hence improve network throughput), applying the LZ compression mechanism on the latter message would result in wasted computing resources since the latter message does not include a lot of redundant bytes and thus cannot be significantly compressed.
Although the disadvantage of using multiple compression mechanisms to compress data is presented above with respect to a network communication system, it is noted that this disadvantage is not unique to the area of network communications. Rather, this disadvantage is common to any systems that employ multiple compression mechanisms to compress and process data, such as, for example, archiving systems and backup systems.
Based on the foregoing, there is a clear need for techniques for balancing throughput and compression in systems that use multiple compression mechanisms.