1. Field of the Invention
The present invention relates to a computer program product, system, method, and data structure for generating a code alphabet of symbols to generate codewords for words used with a program.
2. Description of the Related Art
Data compression involves converting symbols, including data symbols and control symbols, in an input data stream into a compressed output data stream comprising less data than the input data stream. Control symbols are encoded into the compressed data stream and provide decoding instructions to allow decompression, and may be created by the application program which is compressing data. Examples of control symbols created by the application include an end of record control signal, a file mark, and a dataset boundary. There are other events or controls determined by the compression device, such as when to swap to a given compression scheme, and when to reset the history buffer used to perform compression in a given scheme. The compressing device may decide to create an access point at which compression begins with a reset history buffer using a particular scheme after a dataset boundary is encountered.
One type of encoding technique, Huffman coding, provides a lossless compression algorithm that uses a variable length code table of symbols for encoding source symbols where the variable length code table has been derived based on the estimated probability of occurrence for each possible source symbol value to produce encodings of the source symbols.
A streaming lossless data compression algorithm (SLDC) receives an input data stream of control symbols and data symbols and converts strings of consecutive bytes of data symbols into copy pointers and literal data symbols that have fewer bits than the data symbols in the input data stream. The SLDC algorithm is used to compress and decompress data in Linear Tape Open (LTO) magnetic tape cartridges. Details of the SLDC algorithm are described in the Standard ECMA-321 publication “Streaming Lossless Data Compression Algorithm—(SLDC), dated Jun. 1, 2001.
The SLDC algorithm outputs a literal data symbol into the compressed data stream if there are no consecutive bytes following a data byte in the history buffer, so that the literal data byte is outputted, uncompressed. The SLDC algorithm outputs a copy pointer symbol into the output data stream to represent multiple consecutive data bytes that match a same number of consecutive data bytes in the history buffer. The copy pointer symbol includes one bit indicating that the symbol is a copy pointer, a match count field indicating the number of matching consecutive bytes, and a displacement field indicating an absolute memory address having the history buffer entry including the first byte of the matching consecutive bytes.
There is a need in the art to provide techniques for producing code alphabets of symbols used to represent words values in a program, including, but not limited to, compression programs.