1. Field of the Invention
The present invention relates to the implementation of lossless and near-lossless source coding for multiple access networks.
2. Background Art
Source Coding
Source coding, also known as data compression, treats the problem of efficiently representing information for data transmission or storage.
Data compression has a wide variety of applications. In the area of data transmission, compression is used to reduce the amount of data transferred between the sources and the destinations. The reduction in data transmitted decreases the time needed for transmission and increases the overall amount of data that can be sent. For example, fax machines and modems all use compression algorithms so that we can transmit data many times faster than otherwise possible. The Internet uses many compression schemes for fast transmission; the images and videos we download from some bulletin boards are usually in a compressed format.
In the area of data storage, data compression allows us to store more information on our limited storage space by efficiently representing the data. For example, digital cameras use image compression schemes to store more photos on their memory cards, DVDs use video and audio compression schemes to store movies on portable disks, we could also utilize text compression schemes to reduce the size of text files on computer hard disks.
In many electronic and computer applications, data is represented by a stream of binary digits called bits (e.g., 0 and 1). Here is an example overview of the steps involved in compressing data for transmission. The compression begins with the data itself at the sender. An encoder encodes the data into a stream with a smaller number of bits. For example, an image file to be sent across a computer network may originally be represented by 40,000 bits. After the encoding the number of bits is reduced to 10,000. In the next step, the encoded data is sent to the destination where a decoder decodes the data. In the example, the 10,000 bits are received and decoded to give a reconstructed image. The reconstructed image may be identical to or different from the original image.
Here is another example of the steps involved in compressing data for storage. In making MP3 audio files, people use special audio compression schemes to compress the music and store them on the compact discs or on the memory of MP3 players. For example, 700 minutes of MP3 music could be stored on a 650 MB CD that normally stores 74 minutes of music without MP3 compression. To listen to the music, we use MP3 players or MP3 software to decode the compressed music files, and get the reconstructed music that usually has worse quality than the original music.
When transmitting digital data from one part of a computer network to another, it is often useful to compress the data to make the transmission faster. In certain networks, known as multiple access networks, current compression schemes have limitations. The issues associated with such systems can be understood by a review of data transmission, compression schemes, and multiple access networks.
Lossless and Lossy Compression
There are two types of compression, lossless and lossy. Lossless compression techniques involve no loss of information. The original data can be recovered exactly from the losslessly compressed data. For example, text compression usually requires the reconstruction to be identical to the original text, since very small differences may result in very different meanings. Similarly, computer files, medical images, bank records, military data, etc., all need lossless compression.
Lossy compression techniques involve some loss of information. If data have been compressed using lossy compression, the original data cannot be recovered exactly from the compressed data. Los Lossy compression is used where some sacrifice in reconstruction fidelity is acceptable in light of the higher compression ratios of lossy codes. For example, in transmitting or storing video, exact recovery of the video data is not necessary. Depending on the required quality of the reconstructed video, various amounts of information loss are acceptable. Lossy compression is widely used in Internet browsing, video, image and speech transmission or storage, personal communications, etc.
One way to measure the performance of a compression algorithm is to measure the rate (average length) required to represent a single sample, i.e. R=ΣxP(x)l(x), where l(x) is the length of the codeword for symbol x, P(x) is the probability of x. Another way is to measure the distortion, i.e., the average difference between the original data and the reconstruction.
Fixed-length Code
A fixed-length code uses the same number of bits to represent each symbol in the alphabet. For example, ASCII code is a fixed-length code: it uses 7 bits to represent each letter. The codeword for letter a is 1000011, that for letter A is 1000001, etc.
Variable-length Code
A variable-length code does not require that all codewords have the same length, thus we may use different number of bits to represent different symbols. For example, we may use shorter codewords for more frequent symbols, and longer codewords for less frequent symbols; thus on average we could use fewer bits per symbol. Morse code is an example of a variable-length code for the English alphabet. It uses a single dot (.) to represent the most frequent letter E, and four symbols: dash, dash, dot, dash (--.-)to represent the much less frequent letter Q.
Non-singular, Uniquely Decodable, Instantaneous, Prefix-free Code
TABLE 1Classes of CodesNon-singular,UniquelySymbolsP(X)Singularbut notdecodable,Instantaneous10.45011120.25010100130.11010000140.210110000000
A non-singular code assigns a distinct codeword to each symbol in the alphabet. A non-singular code provides us with an unambiguous description of each single symbol. However, if we wish to send a sequence of symbols, a non-singular code does not promise an unambiguous description. For the example given in Table 1, the first code assigns identical codewords to both symbol ‘1’ and symbol ‘2’, and thus is a singular code. The second code is a non-singular code, however, the binary description of the sequence ‘12’ is ‘110’, which is the same as the binary description of sequence ‘113’ and that of symbol ‘4’. Thus we cannot uniquely decode those sequences of symbols.
We define uniquely decodable codes as follows. A uniquely decodable code is one where no two sequences of symbols have the same binary description. That is to say, any encoded sequence in a uniquely decodable code has only one possible source sequence producing it. However, one may need to look at the entire encoded bit string before determining even the first symbol from the corresponding source sequence. The third code in Table 1 is an example of a uniquely decodable code for the source alphabet. On receiving encoded bit ‘1’, one cannot determine which of the three symbols ‘1’, ‘2’, ‘3’ is transmitted until future bits are received.
Instantaneous code is one that can be decoded without referring to future codewords. The third code is not instantaneous since the binary description of symbol ‘1’ is the prefix of the binary description of symbols ‘2’ and ‘3’, and the description of symbol ‘2’ is also the prefix of the description of symbol ‘3’. We call a code a prefix code if no codeword is a prefix of any other codewords. A prefix code is always an instantaneous code; since the end of a codeword is always immediately recognizable, it can separate the codewords without looking at future encoded symbols. An instantaneous code is also a prefix code, except for the case of multiple access source code where instantaneous code does not need to be prefix free (we will talk about this later). The fourth code in Table 1 gives an example of an instantaneous code that has the prefix free property.
The nesting of these definitions is: the set of instantaneous codes is a subset of the set of uniquely decodable codes, which is a subset of the set of non-singular codes.
Tree Representation
We can always construct a binary tree to represent a binary code. We draw a tree that starts from a single node (the root) and has a maximum of two branches at each node. The two branches correspond to ‘0’ and ‘1’ respectively. (Here, we adopt the convention that the left branch corresponds to ‘0’ and the right branch corresponds to ‘1’.) The binary trees for the second to the fourth code in Table 1 are shown in trees 100, 101 and 102 of FIG. 1 respectively.
The codeword of a symbol can be obtained by traversing from the root of the tree to the node representing that symbol. Each branch on the path contributes a bit (‘0’ from each left branch and ‘1’ from each right branch) to the codeword. In a prefix code, the codewords always reside at the leaves of the tree. In a non-prefix code, some codewords will reside at the internal nodes of the tree.
For prefix codes, the decoding process is made easier with the help of the tree representation. The decoder starts from the root of the tree. Upon receiving an encoded bit, the decoder chooses the left branch if the bit is ‘0’ or the right branch if the bit is ‘1’. This process continues until the decoder reaches a tree node representing a codeword. If the code is a prefix code, the decoder can then immediately determine the corresponding symbol.
Block Code
In the example given in Table 1, each single symbol (‘1’, ‘2’, ‘3’, ‘4’) is assigned a codeword. We can also group the symbols into blocks of length n, treat each block as a super symbol in the extended alphabet, and assign each super symbol a codeword. This code is called a block code with block length n (or coding dimension n). Table 2 below gives an example of a block code with block length n=2 for the source alphabet given in Table 1.
TABLE 2Block of SymbolsProbabilityCode110.202500120.1125010130.04510010140.091000210.1125111220.06251101230.02511001240.050111310.04510110320.025101110330.01110001340.02110000410.091010420.050110430.02101111440.0410011
Huffman Code
A Huffman code is the optimal (shortest average length) prefix code for a given distribution. It is widely used in many compression schemes. The Huffman procedure is based on the following two observations for optimal prefix codes. In an optimal prefix code:                1. Symbols with higher probabilities have codewords no longer than symbols with lower probabilities.        2. The two longest codewords have the same length and differ only in the last bit; they correspond to the two least probable symbols.Thus the two leaves corresponding to the two least probable symbols are offsprings of the same node.        
The Huffman code design proceeds as follows. First, we sort the symbols in the alphabet according to their probabilities. Next we connect the two least probable symbols in the alphabet to a single node. This new node (representing a new symbol) and all the other symbols except for the two least probable symbols in the original alphabet form a reduced alphabet; the probability of the new symbol is the sum of the probabilities of its offsprings (i.e. the two least probable symbols). Then we sort the nodes according to their probabilities in the reduced alphabet and apply the same rule to generate a parent node for the two least probable symbols in the reduced alphabet. This process continues until we get a single node (i.e. the root). The codeword of a symbol can be obtained by traversing from the root of the tree to the leaf representing that symbol. Each branch on the path contributes a bit (‘0’ from each left branch and ‘1’ from each right branch) to the codeword.
The fourth code in Table 1 is a Huffman code for the example alphabet. The procedure of how we build it is shown in FIG. 2A.
Entropy Code
The entropy of source X is defined as: H(X)=−Σxp(x)log p(x). Given a probability model, the entropy is the lowest rate at which the source can be losslessly compressed.
The rate R of the Huffman code for source X is bounded below by the entropy H(X) of source X and bounded above by the entropy plus one bit, i.e., H(X)≦R<H(X)+1. Consider data sequence Xn=(X1,X2,X3, . . . ,Xn)where each element of the sequence is independently and identically generated. If we code sequence Xn using Huffman code, the resulting rate (average length per symbol) satisfies:
            H      ⁢              (        X        )              n    ≤  R  <                              H          ⁢                      (            X            )                          +        1            n        .  Thus when the block length (or coding dimension) n is arbitrarily large, the achievable rate is arbitrarily close to the entropy H(X). We call this kind of code ‘entropy code’, i.e., code whose rate is arbitrarily close to the entropy when coding dimension is arbitrarily large.
Arithmetic Code
Arithmetic code is another, increasingly popular, entropy code that is used widely in many compression schemes. For example, it is used in the compression standard JPEG-2001.
We can achieve efficient coding by using long blocks of source symbols. For example, for the alphabet given in Table 1, its Huffman code rate is 1.85 bits per symbol. Table 2 gives an example of a Huffman code for the corresponding extended alphabet with block length two; the resulting rate is 1.8375 bits per symbol showing performance improvement. However, Huffman coding is not a good choice for coding long blocks of symbols, since in order to assign codeword for a particular sequence with length n, it requires calculating the probabilities of all sequences with length n, and constructing the complete Huffman coding tree (equivalent of assigning codewords to all sequences with length n). Arithmetic coding is a better scheme for block coding; it assigns codeword to a particular sequence with length n without having to generate codewords for all sequences with length n. Thus it is a low complexity, high dimensional coding scheme.
In arithmetic coding, a unique identifier is generated for each source sequence. This identifier is then assigned a unique binary code. In particular, data sequence Xn is represented by an interval of the [0,1) line. We describe Xn by describing the mid-point of the corresponding interval to sufficient accuracy to avoid confusion with neighboring intervals. This mid-point is the identifier for Xn. We find the interval for xn recursively, by first breaking [0,1) into intervals corresponding to all possible values of x1, then breaking the interval for the observed X1 into subintervals corresponding to all possible values of X1x2, and so on. Given the interval A[0,1] for Xk for some 0≦k<n (the interval for X0 is [0,1)), the subintervals for {Xkxk+1} are ordered subintervals of A with lengths proportional to p(xk+1).
For the alphabet given in Table 1, FIG. 2B shows how to determine the interval for sequence ‘132’. Once the interval [0.3352, 0.3465] is determined for ‘132’, we can use binary code to describe the mid-point 0.34085 to sufficient accuracy as the binary representation for sequence ‘132’.
In arithmetic coding, the description length of data sequence xn is 1(xn)=┌−log px(xn)┐+1 where px(xn) is the probability of xn; this ensures the interval corresponding to different codewords are disjoint and the code is prefix free. Thus the average rate per symbol for arithmetic code is R=1/nΣxpX(xn)l(xn)=1/nΣxpX(xn)(┌−log pX(xn)┐+1). Rate R is then bounded as:
                    H        ⁢                  (          X          )                    n        ≤    R    <                            H          ⁢                      (            X            )                          +        2            n        ,which shows R is arbitrarily close to the source entropy when coding dimension n is arbitrarily large.
Multiple Access Networks
A multiple access network is a system with several transmitters sending information to a single receiver. One example of a multiple access system is a sensor network, where a collection of separately located sensors sends correlated information to a central processing unit. Multiple access source codes (MASCs) yield efficient data representation for multiple access systems when cooperation among the transmitters is not possible. An MASC can also be used in data storage systems, for example, archive storage systems where information stored at different times is independently encoded but all information can be decoded together if this yields greater efficiency.
In the MASC configuration (also known as the Slepian-Wolf configuration) depicted in FIG. 3A, two correlated information sequences {Xi}i=1∞ and {Yi}i=1∞ are drawn i.i.d. (independently and identically distributed) according to joint probability mass function (p.m.f.) p(x,y). The encoder for each source operates without knowledge of the other source. The decoder receives the encoded bit streams from both sources. The rate region for this configuration is plotted in FIG. 3B. This region describes the rates achievable in this scenario for sufficiently large coding dimension and decoding error probability Pe(n) approaching zero as the coding dimension grows. Making these ideas applicable in practical network communications scenarios requires MASC design algorithms for finite dimensions. We consider two coding scenarios: first, we consider lossless (Pe(n)=0) MASC design for applications where perfect data reconstruction is required; second, we consider near-lossless (Pe(n) is small but non-zero) code design for use in lossy MASCs.
The interest in near-lossless MASCs is inspired by the discontinuity in the achievable rate region associated with going from near-lossless to truly lossless coding. For example, if p(x,y)>0 for all (x,y) pairs in the product alphabet, then the optimal instantaneous lossless MASC achieves rates bounded below by H(X) and H(Y) in its descriptions of X and Y, giving a total rate bounded below by H(X)+H(Y). In contrast, the rate of a near-lossless MASC is bounded below by H(X,Y), which may be much smaller than H(X)+H(Y). This example demonstrates that the move from lossless coding to near-lossless coding can give very large rate benefits. While nonzero error probabilities are unacceptable for some applications, they are acceptable on their own for some applications and within lossy MASCs in general (assuming a suitably small error probability). In lossy MASCs, a small increase in the error probability increases the code's expected distortion without causing catastrophic failure.
MASC Versus Traditional Compression
To compress the data used in a multiple access network using conventional methods, people do independent coding on the sources, i.e., the two sources X and Y are independently encoded by the two senders and independently decoded at the receiver. This approach is convenient, since it allows for direct application of traditional compression techniques to a wide variety of multiple access system applications. However, this approach is inherently flawed because it disregards the correlation between the two sources.
MASC on the contrary, takes advantage of the correlation among the sources; it uses independent encoding and joint decoding for the sources. (Joint encoding is prohibited because of the isolated locations of the source encoders or some other reasons.)
For lossless coding, the rates achieved by the traditional approach (independent encoding and decoding) are bounded below by H(X) and H(Y) for the two sources respectively, i.e. RX≧H(X), and RX+RY≧H(X)+H(Y). The rates achieved by MASC are bounded as follows:RX≧H(X|Y)RY≧H(Y|X)andRX+RY≧H(X,Y).When X and Y are correlated, H(X)>H(X|Y), H(Y)>H(Y|X) and H(X)+H(Y)>H(X,Y). Thus, MASCs can generally achieve better performance than the traditional independent coding approach.
Prior Attempts
A number of prior art attempts have been made to provide optimal codes for multiple access networks. Examples including H. S. Witsenhausen. “The Zero-Error Side Information Problem And Chromatic Numbers.” IEEE Transactions on Information Theory, 22:592–593, 1976; A. Kh. Al Jabri and S. Al-Issa. “Zero-Error Codes For Correlated Information Sources”. In Proceedings of Cryptography, pages 17–22, Cirencester, UK, December 1997; S. S. Pradhan and K. Ramchandran. “Distributed Source Coding Using Syndromes (DISCUS) Design And Construction”. In Proceedings of the Data Compression Conference, pages 158–167, Snowbird, Utah, March 1999. IEEE; and, Y. Yan and T. Berger. “On Instantaneous Codes For Zero-Error Coding Of Two Correlated Sources”. In Proceedings of the IEEE International Symposium on Information Theory, page 344, Sorrento, Italy, June 2000. IEEE.
Witsenhausen, Al Jabri, and Yan treat the problem as a side information problem, where both encoder and decoder know X, and the goal is to describe Y using the smallest average rate possible while maintaining the unique decodability of Y given the known value of X. Neither Witsenhausen nor Al Jabri is optimal in this scenario, as shown in Yan. Yan and Berger find a necessary and sufficient condition for the existence of a lossless instantaneous code with a given set of codeword lengths for Y when the alphabet size of X is two. Unfortunately their approach fails to yield a necessary and sufficient condition for the existence of a lossless instantaneous code when the alphabet size for X is greater than two. Prandhan and Ramchandran tackle the lossless MASC code design problem when source Y is guaranteed to be at most a prescribed Hamming distance from source X. Methods for extending this approach to design good codes for more general p.m.f.s p(x,y) are unknown.