1. Field of Invention
The present invention pertains generally to data processing and more particularly to data compression.
2. Description of the Background
Various methods of compressing data have been developed over the past few years. Because of the increased use of computer systems, requirements for storage of data has consistently increased. Consequently, it has been desirable to compress data for the purpose of speeding both transmission and storage of the data. Additionally, data compression reduces the physical space required to store data.
Of the data compression techniques developed in the prior art, two techniques appeared to be of particular importance to the present invention. The first technique is known as run length encoding wherein a series of repetitive data symbols are compressed into a shorter code which indicates the length of a code and the data being repeated. A large number of different ways of run length encoding have been developed. However, most methods require buffering of data to look ahead in the data stream to determine when a run (i.e., a series of repetitive characters) exists.
Statistical encoding techniques comprise techniques for characterizing data according to its statistical probability of occurrence. Data with a higher probability of occurrence is encoded with a shorter code than data having a lesser probability of occurrence. For example, the American National Standard Code for Information Interchange (ASCII) and the Extended Binary Coded Decimal Interchange (EBCDIC) comprise standard formatting schemes in which numbers, letters, punctuation, carriage control statements and other data are assigned various hexidecimal positions in a data formatting scheme using 8-bit bytes. These alphanumeric symbols, which are assigned different positions depending upon the standard used, have differing probabilities of occurrence. Since a "space" or an "e" has a much higher probability of occurrence than a "y" or a "z" or other nonfrequently occurring hexidecimal numbers, the "space" or "e" is encoded into a code of a lesser number of bits, e.g., 3 or 4 bits, rather than the standard 8 bit per byte code for these alphanumeric symbol. On the other hand, alphanumeric symbols such as "y" and "z" that have a much lower probability of occurrence are encoded into a code having more bits than the standard 8 bit byte code used in ASCII and EBCDIC standards, e.g., "y" and "z" may have 11 bits.
This concept of statistical encoding was first introduced by David A. Huffman, "A Method for Construction of Minimum-Redundancy Code," Proceedings of the IRE, Volume 40, Pages 1098-1101; September, 1952. This article describes a method of obtaining maximum entropy for a given database by examining the probability of occurrence of data in the database.
Huffman statistical encoding techniques are also disclosed by George Grosskopf, Jr. "Generating Huffman Codes," Computer Design, June 1983, pages 137-140. Both of these citations are specifically incorporated herein by reference for all that they disclose.
The "Huffman Code" generated as a result of the statistical encoding employed, is a code which can be uniquely identified as it is read in a serial fashion. In other words, the encoded data is uniquely arranged so that no ambiguity exists in identifying a particular encoded word as the bits of the code are read in a serial fashion. Consequently, flagging signals and other extraneous data is not required in the encoded database.
A problem with the Huffman statistical encoding technique is that the statistical probability of occurrence of particular alphanumeric symbols in any database will be different depending upon the data in the database, the formatting technique used (i.e., ASCII, EBCDIC, or other formatting technique), the nature of the database and various other factors. Several techniques have been used to overcome these disadvantages. For example, one technique which has been used is to study the particular database to be encoded and generate a statistical encoding table for each particular database. The disadvantage of this technique is that the database must be read and studied prior to statistical encoding and cannot, therefore, be encoded as the data is received for the first time.
Another technique which has been used is to study large quantities of data to produce a statistical encoding table which is generally applicable to most databases. Although compression of data can be achieved to some extent, in many cases the data is expanded because the particular database does not match the statistical probability set forth in the generic table used to encode the data. Additionally, maximum compression and maximum entropy of the data encoded is not achieved with this sort of generic database.
A pre-examination search was performed for the present invention. Several references, set forth below, were uncovered which have particular pertinence to the present invention:
______________________________________ U.S. Pat. No. Inventor Date ______________________________________ 3,587,088 Franaszek Jun. 22, 1971 4,420,771 Pirsch Dec. 13, 1983 4,316,222 Subramaniam Feb. 16, 1982 4,494,150 Brickman et al. Jan. 15, 1985 3,394,352 Wernikoff et al. July 23, 1968 ______________________________________
The Franaszek patent discloses a multilevel pulse transmission system which employs codes having three or more alphabets. In accordance with the Franaszek disclosure, a binary pulse signal is converted for transmission into a pulse signal having n possible levels in accordance with the code having three or more alphabets. The input signal is divided into 4-bit words and converted to a multilevel signal using the first alphabet. The DC value of a multilevel signal is then measured. The DC sum value constitutes the average value of the data. If the DC sum value is equal to one, the code used is transmitted in the first alphabet. If the DC sum value is 4, the binary data is converted to the second alphabet. If the DC sum value of the first alphabet is 2 or 3, the binary data is converted to the third alphabet.
Although the Franaszek reference uses multiple tables for encoding, Franaszek requires data to first be encoded with a first alphabet to determine the proper alphabet to use for encoding. In other words, each byte must first be studied to determine its DC sum value prior to selecting the proper alphabet to be used for encoding.
The Pirsch patent discloses a run length encoding technique for multilevel signals. The Pirsch technique is particularly well-suited for video encoder applications wherein error values are produced based upon a picture element predictive technique. The frequently occurring values comprise a zero error signal. Pirsch divides the input data into 9-bit words and then divides these 9-bit words into two groups comprising frequently occurring signals and nonfrequently occurring signals. Frequently occurring signals comprise 9 zero bits. Nonfrequently occurring signals comprise anything other than 9 zero bits. Pirsch then determines the number of times the frequently occurring and nonfrequently occurring signals are produced to provide a run length signal. Statistical encoding techniques are also used to encode the run length number for frequently occurring signals. Statistical encoding techniques are also used to encode the run length number for nonfrequently occurring signals.
The Pirsch patent uses statistical encoding of run length encoded data and uses statistical encoding with two different tables depending upon whether the data consists of frequently or nonfrequently occurring data. As in Franaszek, the presently occurring data is analyzed to perform grouping into frequently and nonfrequently occurring data. Consequently, Pirsch studies and analyzes the data, as does Franaszek, prior to statistically encoding the data. The process of studying data requires extra hardware implementation and is time consuming because of the decision process which must take place during the statistical encoding process.
The Subramaniam patent discloses compression and decompression of digital image data using run length encoding and Huffman statistical encoding. The data is grouped into WB and BW runs. Symbols are generated indicating the length of each of the runs. The symbols are then statistically encoded using statistical data stored in a PROM. The binary data of the symbol constitutes an address in the PROM which stores the statistical data. A special symbol is generated to indicate a change from a WB to a BW run, and vice versa.
Subramaniam is similar to Franaszek and Pirsch in that the data is studied and grouped into WB and BW runs prior to statistical encoding. Again, this is a slow process and requires additional hardware implementation.
The Brickman et al. patent discloses methods of compressing data for text processing. Brickman discloses a system wherein each word received is compared with a word library. If the word is found in the library, only the word address is transmitted. If the word is not found, it is added to the library.
The Wernikoff patent discloses a data compression technique wherein data words are encoded by a plurality of encoders. The Wernikoff system then determines the encoder that provides the most compression of the signal to be transmitted. Tagging symbols are transmitted to identify the type of encoding used. This technique is implemented in a facsimile transmission run length encoding scheme.
Wernikoff et al. requires studying of the data to determine which table has produced maximum compression. Additionally, Wernikoff requires the use of tagging symbols as part of the data to indicate the encoding table used so that the data can be decoded.
Consequently, the prior art has failed to show a system for compression of data using both run length encoding and statistical encoding which minimizes implementation of hardware, maximizes compression and does not require analyzation of the current data to determine the statistical encoding technique to be used to statistically encode the data.