A data stream, such as a data stream that represents an image, a video, a text document, a spreadsheet, etc., is an ordered sequence of data that can require a significant amount of space to store. A compression algorithm may be used to represent a data stream as a compressed version of the data stream. The compressed version of the data stream usually takes up less space than the uncompressed data stream, and may be used to recreate the original data stream, either with some loss of data or with no loss of data when compared to the original data stream.
Some compression algorithms, such as Snappy and GZip, are based on the Lempel-Ziv compression search algorithm. This search algorithm traverses a stream of data to be compressed, and identifies first instances of unique sequences of data (called literals). Many times, a compression application that implements a compression algorithm uses a history buffer to store a limited amount of the data stream to be compressed, and the identification of literals is based on the content of the history buffer. Literals are generally an arbitrary number of bytes, where the number of bytes is selected to optimize identification of matches within the history buffer.
A compression application uses these sequences of data to compress the data stream by assigning numerical representations to the literals and then representing repeated instances of the sequences, in a resulting compressed data stream, using the numerical representations (called references). Many times, the references are smaller to store than the literals that are being represented by the references.
Compression is generally invoked and performed separately for each data stream, with each invocation having no information about other compression invocations. Separate invocation is generally performed even when the data streams being compressed are related, i.e., differing in their actual contents, but representing the same or very similar substantive data. For example, a typical enterprise data processing application often contains many different applications and processes that work on the same data. These different applications utilize and/or produce different data stream formats that are optimized for use with the respective applications. To illustrate, for optimal performance of relational database management system (RDBMS) processing, the data being processed should be in RDBMS native formats. Further, analytics processing in a Big Data system, such as a Hadoop environment, might mandate binary encodings and ad-hoc queries might require textual representations. Thus, utilization of multiple data analytics systems on a single set of data often requires multiple formats of the same logical data to be physically materialized in storage.
It is usually not practicable to simply derive a given data stream from a different, but related, data stream because conversion algorithms can be computationally expensive and generally demand a full data stream read. Thus, when particular data is required to be stored in different encoded formats, the different data streams representing the particular data with different respective encodings are generally compressed and stored independently. Nevertheless, it would be beneficial to leverage similar information within multiple related data streams to reduce the overall size of the compressed versions of the related data streams.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.