The present invention relates to data processing, and more specifically, to data compression.
A data center of an enterprise may include numerous processing elements, data storage devices, network adapters, and other computational resources coupled to one or more internal and/or external data networks. The resources of the data center can be utilized to service many different types of workloads, including customer workloads, which may originate from clients of the enterprise, as well as organizational workloads, which support the business processes of the enterprise. Frequently, the processing of client and organizational workloads require the communication of a substantial volume of data and messages across the internal and/or external data networks of the data center, for example, to or from processing elements and/or data storage devices.
In data center environments, and more generally, in many data processing environments, network bandwidth is a scarce resource that limits the amount of useful work that can be performed utilizing the resources of the data processing environment. Consequently, a variety of techniques have been developed to reduce the bandwidth and storage requirements to store and/or communicate messages and/or data files.
These techniques include data compression, which represents data (e.g., a message or data file) in a more compact form than its original uncompressed form. Data compression techniques can be broadly classified as either lossy or lossless, depending on whether the original data can be decoded from the compressed data without any data loss. Although lossy compression can often achieve a greater compression ratio for certain types of data, the inherent loss of data generally limits its application to multimedia images, video, audio, and other data types for which such data loss is acceptable. For other data types, such as data files, executable files and application messages, such data loss is often unacceptable, and lossless compression techniques are therefore commonly employed. Common lossless compression techniques include run length encoding (RLE), arithmetic encoding, Huffman coding, dictionary-based encoding including Lempel-Ziv encoding and its variants (e.g., LZ77, LZ78, LZW (Lempel-Ziv-Welch), etc.), and delta encoding.
Delta encoding expresses data as differences between reference data and the data to be encoded. The differences between the reference data and the data to be encoded can then be stored or transmitted in lieu of the data to be encoded, where such differences are commonly referred to as “diffs” based on the name of the Unix® file comparison utility diff. Like the diff file comparison utility, delta encoding techniques are commonly based on detection of the longest common subsequence between the reference data and the data to be encoded. The term “longest common subsequence,” which refers to commonality between sequential portions of a dataset and reference data regardless of whether the matching portions are consecutive, should not be confused with the similar term “longest common substring,” which refers to commonality between consecutive sequential portions of a dataset and reference data. Thus, a “substring” of a string is always a subsequence of the string, but a “subsequence” of the string is not always a sub string of the string.