Source coding or data compression techniques are useful in reducing the volume of data generated by a source for the purpose of transmission over a channel or storage. Economies are realized when the volume of data to be transmitted or stored is decreased. There are two kinds of data compression techniques; information preserving and information degrading. A process is information preserving if all the information present before coding can be regenerated after coding. Conversely, a process is information degrading if it is irreversible in the sense that the original information cannot be regenerated exactly after the process is performed. An information degrading process can be used when the user of the information is not interested in all the information generated by the source. If the user is only selectively interested in the information, the information degrading source coding process can reflect that selectivity. Also, if the user is willing to accept a fidelity criterion, the data may be degraded within certain limits and still remain useful. In general, more data compression can be attained with information degrading processes than with information preserving processes.
Source coding systems can be classified as statistical or ad hoc, depending on whether or not the system is in some sense optimal in terms of Shannon's noiseless coding theorum. A final classification of source coding systems is the following: if the system accepts digits from the source in fixed length blocks and then generates a variable length sequence of encoded digits, the system is "block-to-variable"; if the system accepts a variable length sequence of digits from the source and generates a fixed length block of digits the system is "variable-to-block."
The first optimal coding scheme to be developed was Huffman coding. This technique is block-to-variable. Huffman coding specifies the optimal coding for M messages with probabilities p.sub.1, p.sub.2, . . . , p.sub.m, where p is the probability that a "1" is emitted. The least probable message is assigned a code word containing the longest sequence of bits and more probable messages are assigned code words of shorter length. When Huffman coding is applied to a binary memoryless source, the source sequence must be broken up into blocks N bits long. Each block then contains one of M = 2.sup.N possible messages. As N gets large it becomes impractical to calculate the probability of each message and to generate the corresponding code word. However, Huffman coding is only optimal as N approaches infinity. Furthermore, this coding scheme requires knowledge of the source statistics and the code words must be computed a priori. The code words must be stored in the memory of the encoder and a table lock-up performed on each block. The memory requirements grow exponentially with N. In short, complexity and required knowledge of the source statistics detract from the desirability of Huffman coding.
Run length coding is an ad hoc coding scheme that works well when p is either very small or very large. In its simplest implementation the number of consecutive zeros is counted. When the first "1" is encountered, the number of zeros is transmitted as a block of binary digits. Thus, run length coding is variable-to-block and this fact simplifies the hardware considerably. Run length coding is, however, limited by the fact that it is reasonably efficient only when p is very small or very large and even in these cases it is not optimal.
The Schalkwijk algorithm is based on the ranking of binary sequences of length N and weight W. For a binary memoryless source characterized by p, as N approaches infinity, we expect to get a weight, W = pN, with a probability equal to 1. The basic idea is to rank all sequences of length N and containing W = pN ones. Thus, the binary encoding of the sequence is its rank in binary form. Schalkwijk has shown that as N approaches infinity this technique is optimal. Since there are (.sub.W.sup.N) possible sequences, i.e., ##EQU1## each member can be encoded in a word of log.sub.2 (.sub.W.sup.N) bits. Thus, the compression ratio achieved is ##EQU2## which approaches the theoretical limit H(W/N).sup.-1 as N approaches infinity. Mathematically, the rank of the sequence is given by, ##EQU3## where T(N,W) = the set of all binary sequences of length N and weight W, where t.sub.K .epsilon.[0,1] = K.sup.th member of t, and where ##EQU4##
An understanding of the above formula can be facilitated by visualizing it as a random walk through Pascal's triangle determined by the bits in the input sequence, starting at the W.sup.th position of the N.sup.th row, i.e., at the entry corresponding to (.sub.W.sup.N) and terminating at the apex. FIG. 1 illustrates the computation for the sequence 010100.
Given the rank i(t) = 8, the original sequence t can be reconstructed by using the following algorithm. Start at (.sub.2.sup.6) = 15 in FIG. 1. The rank 8 of the sequence is less than the number 10, at a single step in the X-direction from the current starting point 15. Move one step in the X-direction, toward 10 and record a 0. The rank 8 of the sequence t is not less than the number 6, at a single step in the X-direction from the current starting point 10. Move one step in the Y-direction, towards 4; substract the number 6 used in the comparison from the rank 8 of the sequence t, giving a new rank 8 - 6 = 2 and a record a 1 giving 01. The current rank 2 of the sequence t is less than the number 3 at a single step in the X-direction from the current starting point 4. Move one step in the X-direction towards 3 and record a 0, giving 010. The current rank 2 of the sequence is not less than the number 2 at a single step in the X-direction from the current starting point 3. Move one step in the Y-direction towards 1; subtract the number 2 used in the comparison from the current rank 2 of the sequence t, giving a new rank 2 - 2 = 0 and record a 1, giving 0101. The current rank of the sequence t is now 0. Thus the last two steps are taken in the X-direction, resulting in the desired sequence 010100.
Table I below gives the complete set T(6,2) with the corresponding ranks of the sequences.
TABLE 1. ______________________________________ SET, T(6,2), WITH CORRESPONDING RANKING Rank Sequence ______________________________________ 0 000011 1 000101 2 000110 3 001001 4 001010 5 001100 6 010001 7 010010 8 010100 9 011000 10 100001 11 100010 12 100100 13 101000 14 110000 ______________________________________
Schalkwijk developed two schemes for implementation of this ranking procedure: (1) Variable-to-block coding and (2) Block-to-variable coding.
The variable-to-block coding scheme accounts in the following way for the fact that an actual source sequence of N bits may not contain exactly W ones. The starting point for the coding process is the entry in Pascal's triangle which equals (.sub.W.sup.N). Bits are accepted from the source one at a time. Let n be the number of bits that have been accepted at any point and w the weight of the accepted sequence. If for n &lt; N, the weight of the accepted sequence equals W, source bits are no longer accepted and the accepted sequence is filled out with N - n dummy zeros. If for n &lt; N, W - w = N - n, source bits are no longer accepted and the sequence is filled out with W - w dummy ones. In any case at most N - 1 source digits are accepted followed by a dummy digit.
FIG. 2 illustrates how the algorithm works when visualized as a random walk in Pascal's triangle. The walk starts, for example, at (.sub.2.sup.6) = 15. Assume that the incoming sequence is 001101. . . . For each 0 encountered a step in the X-direction is taken and for each 1 encountered a step in the Y-direction is taken. Thus, two steps in the X-direction are taken to arrive at 6, the second element in the fourth row and then two steps in the Y-direction are taken to arrive at 1, the 0.sup.th element of the second row. Since this last element is on a boundary, the sequence is filled out with dummy zeros. Thus i(001100) = 5 = 0101 is transmitted. If the right hand boundary had been reached the sequence would have been filled out with dummy ones. The decoder looks at the decoded sequence and, if the last bit is a zero, strips off all zeros until it comes to a one. If the last bit is a 1, it strips off all ones until it comes to a zero.
The main limitation of Schalkwijk's variable-to-block scheme is the fact that the source statistics must be known a priori. For sources of unknown, nonstationary, or time-varying statistics, this scheme is unrealistic. Also, even for sources of known statistics, Schalkwijk's theory gives no reasonable way to decide upon the block length as a function of source statistics, equipment complexity, and compression ratio for finite block length.
Schalkwijk's second scheme, block-to-variable coding accounts in the following way for the fact that an actual source sequence of N bits may not contain exactly W ones. A block of N source bits is accepted. The weight W of the block is measured. The starting point in Pascal's triangle is determined by W and is the point on the N.sup.th line whose entry is equal to (.sub.W.sup.N). The encoded block will be of variable length depending on the weight of W. A prefix of log.sub.2 (N + 1) bits is attached to the encoded word to distinguish among the N + 1 position on the N.sup.th line and communicate this starting point information to the decoder. The prefix contains the weight W.
The process can be visualized as a random walk in Pascal's triangle, starting at the apex, by referring to FIG. 3. Let the block length N = 6. Assume that the input sequence is 0100101100. . . . The coder looks at the first six bits, notices that there are two ones, and sets the starting point at (.sub.2.sup.6) = 15. The coding process terminates when the boundary corresponding to the block length is reached, in this case the sixth row. The walk starts with a step in the X-direction corresponding to the first zero, and proceeds with a step in the Y-direction corresponding to the first 1, and 1 is added to the running suffix giving a total of 1. Next, two steps in the X-direction followed by one step in the Y-direction are taken and 6 is added to the running suffix giving 7. Next one step in the X-direction is taken, arriving at (.sub.2.sup.6) = 15, and here the process terminates. The coded suffix equals 7.
The limitation of this scheme is complexity. Taking as a measure of complexity the number of bits required to store Pascal's triangle in memory, we have the following result: Complexity = N(N + 1)(N + 2)/2 = N.sup.3 + 3N.sup.2 + 2N/2. There are (N + 1)(N + 2)/2 words and the maximum word length is approximately N bits. In a practical system all words would have to be standardized at N bits even though the transmitted word length might be less than N. From the above result it can be seen that the complexity increases exponentially with the block length N. The most serious drawback to the Schalkwijk block-to-variable scheme is that the complexity of the hardware required to approach optimum increases as the entropy decreases. This is the opposite of what should be the case. The reason for this perverse behavior of the Schalkwijk block-to-variable scheme is the following. The length of the prefix sets a limit on the maximum obtainable compression. This limit is equal to the input block length divided by the length of the prefix, which equals the logarithm to the base 2 of the input block length plus 1.
Table 2 below gives the length of the prefix as the function of input block length.
TABLE 2. ______________________________________ PREFIX AS A FUNCTION OF INPUT BLOCK LENGTH Prefix Length Input Block Length ______________________________________ 2 2 3 4 4 8 5 16 6 32 7 64 8 128 9 256 10 512 11 1024 ______________________________________
A further drawback to the block-to-variable scheme is that for any particular input block length N, the word length within the hardware implementation must also be N, although for low entropy blocks the length of the coded output will be much less than N. There is inefficient use of space within the hardware since the full word length is utilized only when W = N/2. In other cases of low entropy, the additional complexity implied by the longer word length is wasted.