This invention relates generally to systems for digital data processing, and, more particularly, relates to highly parallel methods and apparatus for dynamic, high speed, on-line compression and decompression of digital data.
Data compression methods, most of which exploit the redundancy that is characteristic of data streams, are becoming an essential element of many high speed data communications and storage systems. In communications systems, a transmitter can compress data before transmitting it, and a receiver can decompress the data after receiving it, thus increasing the effective data rate of the communications channel. In storage applications, data can be compressed before storage and decompressed after retrieval, thereby increasing the effective capacity of the storage device.
Particular applications of data compression include magnetic and optical disk interfaces, satellite communications, computer network communications, interconnections between computers and attached devices, tape drive interfaces, and write-once media, where storage space reduction is critical.
Compression of data streams can be lossless or lossy. Lossless data compression transforms a body of data into a smaller body of data from which it is possible to exactly and uniquely recover the original data. Lossy data compression transforms a body of data into a smaller one from which an acceptable approximation of the original--as defined by a selected fidelity criterion--can be constructed.
Lossless compression is appropriate for applications in which it is unacceptable to lose even a single bit of information. These include transmission or storage of textual data, such as printed human language, programming language source code or object code, database information, numerical data, and electronic mail. Lossless compression is also used for devices such as disk controllers that must provide exact preservation and retrieval of uncorrupted data.
Lossy compression is useful in specialized applications such as the transmission and storage of digitally sampled analog data, including speech, music, images, and video. Lossy compression ratios are typically much higher than those attainable by purely lossless compression, depending upon on the nature of the data and the degree of fidelity required. For example, digitized speech that has been sampled 8,000 times per second, with 8 bits per sample, typically compresses by less than a factor of 2 with any lossless algorithm. However, by first quantizing each sample, which preserves acceptable quality for many applications, compression ratios exceeding 20 to 1 can be achieved. In contrast, typical lossless compression ratios are 3 to 1 for English text, 5 to 1 for programming language source code, and 10 to 1 for spreadsheets or other easily compressible data. The difference is even greater for highly compressible sources such as video. In such cases, quantization and dithering techniques may be applied in combination with otherwise lossless compression to achieve high ratios of lossy compression.
Among the most powerful approaches to lossless data compression are textual substitution methods, in which frequently-appearing data strings are replaced by shorter indexes or pointers stored in correspondence with the data strings in a dictionary. Typically, an encoder module and a decoder module maintain identical copies of a dictionary containing data strings that have appeared in the input stream. The encoder finds matches between portions of the input stream and the previously-encountered strings stored in the dictionary. The encoder then transmits, in place of the matched string, the dictionary index or pointer corresponding to the string.
The encoder can also update the dictionary with an entry based on the current match and the current contents of the dictionary. If insufficient space is available in the dictionary, space is created by deleting strings from the dictionary.
The decoder, operating in a converse manner, receives at its input a set of indexes, retrieves each corresponding dictionary entry as a "current match", and updates its dictionary. Because the encoder and decoder work in a "lock-step" fashion to maintain identical dictionaries, no additional communication is necessary between the encoder and decoder.
Thus, the input to the encoder of an on-line textual substitution data compressor is a stream of bytes or characters, and the output is a sequence of pointers. Conversely, the input to the decoder is a stream of pointers and the output is a stream of bytes or characters.
A common example of a textual substitution method is the publicly available COMPRESS command of the UNIX system, which implements the method developed by Lempel and Ziv. See J. Ziv, A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, Vol. IT-23, No. 5, pp. 337-343, 1977; and J. Ziv, A. Lempel, "Compression of Individual Sequences Via Variable Rate Coding," IEEE Transactions on Information Theory, Vol. IT-24, No. 5, pp. 530-536, 1978.
Textual substitution methods are generally superior to conventional methods such as Huffman coding. For example, the COMPRESS command of the UNIX system easily out-performs the COMPACT command of UNIX, which is based on Huffman coding.
Further examples of data compression methods and apparatus are disclosed in the following:
U.S. Pat. No. 4,876,541.
U.S. Pat. No. 4,814,746.
S. Henriques, N. Ranganathan, "A Parallel Architecture for Data Compression," Proceedings of the IEEE Symposium on Parallel and Distributed Processing, pp. 260-266, December 1990.
R. Zito-Wolf, "A Broadcast/Reduce Architecture for High-Speed Data Compression," Proceedings of the IEEE Symposium on Parallel and Distributed Programing, pp. 174-181, December 1990.
S. Bunton, G. Borriello, "Practical Dictionary Management for Hardware Data Compression," Apr. 2, 1990.
R. Zito-Wolf, "A Systolic Architecture for Sliding-Window Data Compression," Proceeding of the IEEE Workshop on VLSI Signal Processing, pp. 339-351, 1990.
C. Thomborson, B. Wei, "Systolic Implementations of a Move-to-Front Text Compressor," Journal of the Association for Computing Machinery, pp. 283-290, 1989.
J. Storer, Data Compression, Computer Science Press, pp. 163-166, 1988.
J. Storer, "Textual Substitution Techniques for Data Compression," Combinatorial Algorithms on Words, Springer-Verlag (Apostolico and Galil, ed.), pp. 111-130, 1985.
V. Miller, M. Wegman, "Variations on a Theme by Ziv and Lempel," Combinatorial Algorithms on Words, Springer-Verlag (Apostolico and Galil, ed.), pp. 131-140, 1985.
M. Gonzalez Smith, J. Storer, "Parallel Algorithms for Data Compression," Journal of the Association for Computing Machinery, Vol. 32, No. 2, pp. 344-373, April 1985.
U.S. Pat. No. 4,876,541 discloses a data compression system including an encoder for compressing an input stream and a decoder for decompressing the compressed data. The encoder and decoder maintain dictionaries for storing strings of frequently appearing characters. Each string is stored in association with a corresponding pointer. The encoder matches portions of the input stream with strings stored in the encoder dictionary, and transmits the corresponding pointers in place of the strings, thereby providing data compression. The decoder decompresses the compressed data in a converse manner. The system utilizes selected matching, dictionary update, and deletion methods to maintain processing efficiency.
U.S. Pat. No. 4,814,746 discloses a data compression method including the steps of establishing a dictionary of strings of frequently appearing characters, determining a longest string of the input stream that matches a string in the dictionary, transmitting the pointer associated with that string in place of the string, and adding a new string to the dictionary. The new string is a concatenation or linking of a previous matched string and the current matched string. The method also includes the step of deleting a least recently used string from the dictionary if the dictionary is full.
Henriques et al. discloses a systolic array of processors for executing sliding window data compression in accordance with the Ziv and Lempel method. Data compression is divided into parsing and coding. During parsing, the input string of symbols is split into substrings. During coding, each substring is sequentially coded into a fixed length code.
Zito-Wolf ("A Broadcast/Reduce Architecture for High-Speed Data Compression") discusses a data compression system utilizing a sliding window dictionary and a combination of a systolic array and pipelined trees for broadcast of input characters and reduction of results. In this system, for every character position of the input stream, the processor computes a pair (length, offset) identifying the longest matched string ending at that character.
Bunton et al. discusses data compression methods utilizing a dictionary tagging technique for deleting selected entries from the data compression dictionary. The TAG dictionary management scheme disclosed therein employs a structure known as a trie data structure with tagged nodes.
R. Zito-Wolf ("A Systolic Architecture for Sliding-Window Data Compression") discusses a systolic pipe implementation of textual substitution data compression utilizing a sliding window. The systolic-array architecture codes substrings of the input by reference to identical sequences occurring in a bounded window of preceding characters of the input, wherein the contents of the window form a dictionary of referenceable strings.
Thomborson et al. discloses systolic implementations of data compression utilizing move-to-front encoding. A move-to-front encoder finds the current ordinal position of a symbol in the input stream, transmits that ordinal position, and moves the symbol to the front of the list. A characteristic of this encoder is that more-frequently occurring input symbols will be at the front of the list.
Storer (1988) discusses parallel processing implementations utilizing the dynamic dictionary model of data compression. An ID heuristic is implemented for updating the dictionary, and a SWAP heuristic is executed for deleting entries from the dictionary to create space for new entries. The SWAP heuristic is implemented by doubling the storage elements and adding a controller to each end of the pipe, for switching input/output lines appropriately as the dictionaries are switched. A systolic pipe implementation utilizes a "bottom up" matching technique. In particular, as a stream of characters or pointers flows through the systolic pipe, longer matched strings are constructed from pairs of smaller matched strings. Each processor supports a FLAG bit that can be set during data compression to designate a "learning" processor in the pipe. The learning processor is the first processor along the pipe that contains a dictionary entry.
Storer (1985) discusses off-line and on-line data compression methods utilizing textual substitution in which pointers are transmitted in place of character strings. An on-line implementation described therein utilizes an encoder and decoder, each having a fixed amount of local memory forming a local dictionary. Another implementation utilizes a systolic array of processing elements utilizing a sliding dictionary to match strings in the input stream. The sliding dictionary stores the last N characters processed in the systolic pipeline, one character per processing element. The dictionary is updated by sliding old characters to the left in order to bring in new characters.
Miller et al. (1985) discusses enhancements to the data compression methods of Ziv and Lempel, including a dictionary replacement strategy that enables the use of a fixed size dictionary, methods for using a larger class of strings that can be added to the dictionary, and techniques that avoid uncompressed output.
Gonzalez-Smith et al. discloses systolic arrays for parallel implementation of data compression by textual substitution. The systems described therein implement a sliding dictionary method, in which characters being read in are compared with each of the elements of a dictionary that spans the N characters preceding the current character.
These publications accordingly disclose various systems for data compression. However, one deficiency shared by conventional data compression methods and systems, such as those described above, is the tradeoff between the benefits of data compression and the computational costs of the encoding and subsequent decoding. There exists a need for data compression systems that provide high compression ratios and high speed operation, without necessitating complex encoding and decoding modules.
It is thus an object of the invention to provide lossless data compression systems utilizing massively parallel architectures to compress and decompress data streams at high rates, and with high compression ratios.
It is a further object of the invention to provide such methods and apparatus that can be embodied in relatively simple, low-cost digital hardware.
A further object of the invention is to provide data compression methods and apparatus that dynamically adapt to changing data streams.
Other general and specific objects of the invention will in part be obvious and will in part appear hereinafter.