1. Field of the Invention
This invention generally relates to data compression, and more specifically, to methods and apparatus for losslessly compressing data. Even more specifically, the preferred embodiment of the invention relates to rateless source coding with/without decoder side information by using a small number of codes, each of which can be adapted to accommodate in a wide range of compression rates and input data lengths.
2. Background Art
The state-of-the-art approach to lossless data compression with decoder only side information (also called Slepian-Wolf coding) is to use low-density parity-check (LDPC) codes. To see how a LDPC code can be applied to compress a sequence X1X2 . . . XN, let us look at an example. Without losing generality, suppose that the sequence X1X2 . . . XN is a binary sequence. On the encoder side, since a LDPC code can be conveniently represented by a bi-partite graph, we can regard X1X2 . . . XN as the input to the variable nodes (circle-shaped) of the bi-partite graph in FIG. 1. The output S1S2 . . . Sm of the LDPC code is taken from the check nodes (square-shaped) of the bipartite graph in FIG. 1 in response to the input X1X2 . . . XN, where each Sj is a linear combination of the subset of X1X2 . . . XN. For example, if S1 is connected to X1, X3, and X5. Then S1=X1⊕X2⊕X5, where ⊕ denotes addition in the binary field GF(2). In the literature, the sequence S1S2 . . . Sm is referred to as the syndrome of X1X2 . . . XN. Since S1S2 . . . Sm is typically a much shorter binary sequence than X1X2. . . XN, compression is achieved, and the compression rate is equal to m/N.
On the decoder side, the decoder uses the same bi-partite graph to decode X1X2 . . . XN from the side information Y1Y2 . . . YN and the received sequence S1S2 . . . Sm. Note that in the case where the side information Y1Y2 . . . YN is not available, we can regard Y1Y2 . . . YN as erasures at the decoder input. This convention is used throughout this document. There are many algorithms that can be used in the decoding process. One of the practically important low complexity decoding algorithm is belief propagation (BP) based iterative “message passing” decoding algorithm. BP decoding was first discussed by Robert Gallager in his dissertation “low density parity check codes” in 1963. A collection of papers on LDPC codes and BP decoding algorithms can be found in the “Special Issue on Codes on Graphs and Iterative Algorithms,” IEEE Transactions on Information Theory, 47(2), February 2001. In the source coding with decoder side information setup, a typical decoder using BP decoding works as follows. From Y1Y2 . . . YN and the statistical correlation between Y1Y2 . . . YN and X1X2 . . . XN, the decoder gets the initial soft information about X1X2 . . . XN, that is, the likelihood that each Xi is equal to 0 or 1 in the binary case. This initial soft information is injected into the variable nodes of the bi-partite graph, and propagates along the edges in the bi-partite graph to the check nodes. At the check nodes, this information is updated according to the constraints set by the received syndrome values S1S2 . . . Sm, and sent back to the variable nodes. Combining the information from the checknodes and the initial soft information, the variable nodes get a new iteration of the soft information about X1X2 . . . XN. From this new information, the decoder can form an estimate of X1X2 . . . XN. The above process of exchanging information between variable nodes and check nodes continues until either one of the following conditions is satisfied: 1) the number of iterations exceeds a pre-set threshold; or 2) the decoder has obtained an estimated sequence whose output in the bi-partite graph is equal to S1S2 . . . Sm. If Case 2) happens and the estimated sequence is equal to X1X2 . . . XN, the decoding is successful. If the event that the decoding is successful happens with probability close to 1, the compression efficiency of the LDPC code is determined by the gap between the rate m/N and the theoretical limit H(X1X2 . . . XN|Y1Y2 . . . YN)/N.
It may be observed that in the above approach, once the LDPC code, or equivalently the bi-partite graph, is fixed, the compression rate is fixed (in the above example, m/N). Therefore, if the side information changes, quite often the LDPC code has to be changed in order to maintain compression efficiency. This implies that for each distinct design rate, one has to design a distinct LDPC code, and thus incurs significant storage complexity.
When the quality of the decoder side information (evaluated by the quantity H(X1X2 . . . XN|Y1Y2 . . . YN)) is unknown, it is clearly desirable to have a code that can be easily adapted to a wide range of design rates, while at the same time maintains compression efficiency. Such a code is called a rateless code for brevity, and the property of being able to adapt to a range of design rates is called rate adaptivity. In order to achieve rate adaptivity, prior art procedures often choose to modify the code (or equivalently the bi-partite graph) in a fixed way and thus sacrifice compression efficiency. Specifically, in Universal variable length data compression of binary sources using fountain codes, in Proc. ITW 2004, G. Caire, S. Shamai, A. Shokrollahi, and S. Verdu, used the so-called fountain codes with iterative doping at the encoder side to achieve rate adaptivity. However, the fountain codes are designed specifically for binary erasure channels (BEC), and might perform poorly in data compression where the channel involved is often not a BEC. Another drawback of the Fountain code approach is that the encoder uses a computationally complex procedure called iterative doping.
FIG. 2 shows a naïve approach to rate adaptivity, in which a sequence X1X2 . . . XN is input to an LDPC code, which then outputs syndrome S1 . . . Sm. In the naive approach, rate adaptivity is achieved by punctuating the syndrome sequence S1 . . . Sm. For example, if one would like to achieve rate m1/N, where m1 is less than m, the encoder simply selects m1 symbols from S1 . . . Sm as the output, and discards the rest of the sequence (‘punctured’). It is easy to see that as m1 gets farther away from m, more and more syndrome symbols in S1 . . . Sm are punctured, or equivalently, more and more check nodes in the bi-partite graph of the LDPC code become unused. Consequently, the performance of the naive approach degrades rapidly as the distance between m and m1increases.
In Distributed source coding using serially-concatenated-accumulate codes, in Proc. ITW 2004, J. Chen, A. Khisti, D. M. Malioutov, and J. S. Yedidia proposed a rate adaptive scheme based on syndrome splitting using product accumulate code (PA) and accumulate extended Hamming (e-Hamming) code. The set of codes based on this scheme are semi-regular and the degree profiles are difficult to be optimized by standard techniques. FIG. 3 illustrates a procedure for achieving rate adaptivity by syndrome splitting. A similar work was considered by D. Varodayan, A. Aaron, and B. Girod in Rate-adaptive distributed source coding using low-density parity-check codes, in Proc. Asilomar, Pacific Grove, Calif., 2005, where rate adaptivity is achieved by sending part of the accumulated syndromes. However, the performance of these codes is still far away from the theoretical compression limit.
Another approach to designing rateless codes was recently proposed by A. Eckford and W. Yu, in Rateless Slepian-Wolf codes, in Proc. Asilomar, Pacific Grove, Calif., 2005. In this approach, a dual-purpose large bi-partite graph along with an additional small bi-partite graph was designed as a single rateless code. The main drawback of this approach is that it becomes daunting to design more than two coding rates.