1. Field of the Invention
This invention relates to data compression, applicable to both data communications and data storage. Considerable efforts have been directed to the compression of data for transmission purposes because of the restrictions imposed on data communications by channel bandwidth limitations and a finite number of channels. Most efforts to increase storage capacities have been directed toward devising new ways of recording more data on the selected media, i.e., increasing the number of tracks and the data density. Relatively less attention has been given to the use of data compression to reduce the amount of space required to store data. The scheme for compressing data described herein is equally applicable to data transmission and to data storage.
The invention relates particularly to data compression schemes using bit serial schemes and increasing the compression ratio, i.e., the number of bits which can transmit a given message divided by the total number of bits in the given message, by incorporating a repeat-character mode where appropriate.
2. Description of Related Art Information Theory provides a measure of the amount of information in a message. The minimum number of characters, m, required to send a message of n characters is the sum of the products of the number of occurrences of each character times the logarithm to the base k of the inverse probability of the occurrence of such character, where k is the number of different characters. For binary characters, k is equal to 2 and EQU m=n( 1)log(1/P[1])+n(0)log(1/P[0]) (1)
where
m=the minimum number of bits required, PA1 n(1)=number of "1" bits in the message, PA1 n(0)=number of "0" bits in the message, PA1 P[1]=probability of occurrence of "1" bit, and PA1 P[0]=probability of occurrence of "0" bit.
The total number of bits in the message, i.e., the bit stream to be transmitted or stored, is EQU n(t)=n(1)+n(0)
so that EQU P[1]=n(1)/n(t) and P[0)=n(0)/n(t).
Logarithms in equation (1) are taken to the base 2. (The logarithm of the inverse of a probability is usually taken since probabilities are fractions which have negative logarithms. The negative logarithm of a number is the logarithm of its inverse.)
The informational content (entropy) can always be calculated knowing the entire message. Compression using Huffman codes is based on a priori knowledge of the information parameters. For example, the telegraphic Morse code is closely correlated to a Huffman code of the English language. That is, the shorter codes are assigned to the more frequently occurring letters of the alphabet. A Huffman code using binary digits (bits) can also be assigned to the letters of the English language. For example,
______________________________________ 11 = E 10 = T 01 = O 001 = A 0001 = N 000010 = I 000011 = R . . . . . . ______________________________________
A basic requirement of a Huffman-type code is that each encoding of a character be distinct so that it can be uniquely decoded without delimiters between characters.
There are, of course, many possible encoding schemes but they depend on a priori knowledge of the information such as, in the above example, the letter frequency of the English language. Such encoding schemes, also referred to as data compression systems, use the entire original message to extract the informational parameters for assigning an efficient code. For long messages, the procedure of using the total message to derive the encoding parameters is undesirable for many apparent reasons.
U.S. Pat. No. 4,633,490 describes a compression/decompression with symmetrical adaptive models used in the transmitting and receiving systems. The models predict the nth bit based on statistics derived from the preceding (n-1) bits.
The encoders described below are directed to encoding messages using the parameters derived from the bits that have been processed, i.e., the coding is based on the history created by the information already processed.