Such compression is rendered possible by the high degree of intercorrelation and nonuniform statistical distribution associated with the symbols usually characterizing large data bases. When the source data is represented in digital form, the object of compression is to produce an output string of data having fewer data bits than the original source data string. When a statistically significant sample of source data strings is compressed, the expected length of the compressed output data string corresponds to the "entropy" of the source data string.
One way to achieve compression of source data is by means of the arithmetic coder developed by Jorma Rissanen and first published in an article entitled "Generalized Kraft Inequality Arithmetic Coding", IBM Journal of Research and Development, Vol. 20, No. 3, May 1976. The arithmetic coding procedure introduced by Rissanen permits the compression of multi-alphabet data, i.e., data each of whose symbols may be found within a multi-symbol alphabet.
In order to employ an arithmetic coding procedure, it is first necessary to determine the probability of occurrence for each symbol within the source alphabet. Typically, the probabilities will vary from one source data string to another, so that if, for example, an arithmetic coding procedure is being used to compress the data corresponding to a television image, the probabilities of occurrence for each picture element must first be determined. This may be determined in real time or, alternatively, may be predetermined statistically. However, the actual method for determining these probabilities is not a feature of the present invention which is equally applicable regardless of how the probabilities are determined.
Having determined the probabilities of occurrence for each different symbol within the source alphabet, the cumulative probability for each symbol may then be determined. Thus, the cumulative probability S(1) for the first symbol of the data string will be equal to zero and, in general, the cumulative probability S(n) for the nth symbol will be equal to the sum of the probabilities of occurrence for each of the preceding n-1 symbols.
Arithmetic coding procedures normally represent the output data string as a binary fraction within the unit interval (0,1). As explained by G. Langdon in "An Introduction to Arithmetic Coding", IBM Journal of Research and Development, Vol. 28, No. 2, Mar. 1984, arithmetic coding is related to the process of subdividing the unit interval. This subdivision is achieved by marking along the unit interval code points C.sub.n for each symbol within the source alphabet, each code point being equal to the sum of the probabilities of occurrence of the preceding symbols. The width or size A.sub.n of the subinterval to the right of each code point represents the probability of occurrence of the source data string up to the corresponding symbol (FIG. 1).
Consider, for example, a source data string whose alphabet comprises symbols a.sub.0 to a.sub.m, having probabilities of occurrence equal to p(0) to p(m), respectively. If the source data string is a a.sub.0 a.sub.5 a.sub.3 . . . , then the first symbol a.sub.0 will be encoded within the subinterval (0,p(0)). This represents a first subinterval within the original unit interval whose width A.sub.1 is equal to p(0) corresponding simply to the probability of occurrence of symbol a.sub.0. In order to encode the second symbol a.sub.5 of the source data string, its probability of occurrence, conditional on the probability of symbol a.sub.0 occurring, must be determined. Furthermore, the cumulative probability S(5) associated with the second symbol a.sub.5 must be calculated. Thus, the subinterval corresponding to the second symbol a is a second subinterval within the first subinterval corresponding to a.sub.0. Mathematically, the width A.sub.2 of the second subinterval is equal to p(0)*p(5), i.e., the product of the probabilities of occurrence of both symbols a.sub.0 and a.sub.5. The starting point of the second subinterval within the unit interval depends on the width A.sub.1 of the first subinterval and the cumulative probability S(5) associated with the second symbol a being equal to their product A.sub.1 *S(5).
Thus, as each symbol of the source data string is successively encoded within the unit interval, a succession of subintervals is generated, each of which may be specified in terms of a specific code point and width. The code point for the current subinterval corresponds to the start of the current subinterval within the previous interval or subinterval. As explained above, this is equal to the cumulative probability associated with the current symbol. Thus, the code point associated with the nth subinterval will be equal to the width of the n-1th subinterval multiplied by the cumulative probability of the preceding n-1 symbols, i.e., A.sub.n S(n). The width of the new subinterval itself will be equal to the product of the probabilities of all symbols (including the current one) so far encoded, i.e., p(0)*p(5)*p(3) . . . , for the above source data string. The data corresponding to the width A.sub.n and code points C.sub.n of the nth subinterval thus encode the first n+1 symbols in the source data string. Arithmetic coders therefore require two memory registers, usually called the A and C registers, for storing these data.
Since the width of the subinterval is equal to a product of probabilities, two factors emerge. First, as more symbols of the source data string are encoded, the width of the subinterval defining the arithmetic code representation will decrease (since each individual probability must necessarily be smaller than 1); and, furthermore, the process of arithmetic coding requires a succession of multiplication operations for its effective implementation.
Although arithmetic coders produce optimum compression corresponding to the entropy of the source data string, when based on the exact probabilities of occurrence of the symbols constituting the data string, prior implementations of arithmetic coding procedures have tended to introduce approximations on account of the difficulty in determining the exact probabilities Such approximations reduce the efficiency of the arithmetic coding operation and result in an output data string being generated which has more symbols than the theoretical minimum, or entropy. Moreover, further approximations have been introduced in order to eliminate the multiplication operation, which is required for determining the width of each successive subinterval.
Arithmetic coders are implemented on computers whose memory registers contain a finite number of bits. One problem associated with multiplication results from the fact that, since the successive multiplication of probabilities always produces smaller and smaller intervals, after only a few such multiplications the resulting subinterval may be too small to be satisfactorily stored in the computer register. For example, if each register has 16 bits and the multiplication of successive probabilities results in a number smaller than 2.sup.-16, this number will underflow the register. In other words, the register will be full of zeros, the significant bits of the probability product being lost. A further problem associated with successive multiplication operations is the time taken for these to be implemented.
The first of the above drawbacks has been solved using a technique called normalization, whereby the probability product is stored in floating point notation. In order to do this, a further bit register is employed for storing the exponent (to base 2) for the width of the subinterval when the most significant 1 of the binary fraction is shifted to the extreme leftmost position Thus, the binary fraction 1.0101.times.2.sup.-20 which clearly cannot be stored in a 16-bit register, may satisfactorily be stored as 1.0101E-20, wherein the mantissa and the exponent are stored in separate registers. Since the most significant bit of the mantissa is thus arranged always to be 1, the actual number stored in the mantissa register will always be greater than 1.0.
The second problem described above and related to the time taken to perform multiplication operations has to some extent been solved by the implementation of so-called "multiplication-free" arithmetic coders. In computers which employ binary arithmetic, multiplication is implemented as a series of SHIFT and ADD operations. The term "multiplication-free" has variously been used in the prior art to imply either a single SHIFT or a single SHIFT and ADD operation for each coding step. Strictly speaking, mathematically even a single SHIFT operation constitutes multiplication. However, this is many orders of magnitude less time consuming than the multiplication operations associated with precise implementations of arithmetic coders involving multiple SHIFT and ADD operations. Thus, "multiplication-free" is often used to denote that multiplication operations are significantly simplified or reduced. It is in this context that "multiplication-free" has been used in the prior art and is used in the present invention.
Langdon et al., U.S. Pat. No. 4,286,256, "Method and Means for Arithmetic Coding Utilizing a Reduced Number of Operations", issued Aug. 25, 1981, simplify the multiplication operation by truncating one of the inner products corresponding to the width of the subinterval prior to encoding the current codepoint. However, Langdon's method is suitable only for binary sources (i.e., alphabets containing only two symbols), wherein it is possible to encode each symbol of the source data string either as a more probable or less probable event. This procedure is unsuitable for multi-alphabet codes.
Mohiuddin et al., U.S. Pat. No. 4,652,856, "Multiplication-free Multi-alphabet Arithmetic Code", issued Mar. 24, 1987, disclose an arithmetic code in which each subinterval is stored in floating point format, as explained above, such that the mantissa stored within the A register is a binary fraction greater than 0.1. In accordance with the approximation proposed by Mohiuddin, a variable criterion is adopted which either truncates the mantissa of the subinterval to exactly 0.1 (binary) or, alternatively, rounds it up to 1. Such an approximation still achieves the desired compression, but at a loss of efficiency. In other words, more bits than the minimum are required for representing the compressed data string. The inefficiency associated with Mohiuddin's procedure depends on the nature of the source data being compressed.
The major drawback with Mohiuddin results from the fact that rounding up the contents of the A register approximates the probability of the corresponding symbol to a value greater than its actual value. In order to ensure that the sum of the probabilities for all symbols in the alphabet cannot exceed 1.0, Mohiddiun approximates the last subinterval to ##EQU1##
While this approximation ensures that the sum of the probabilities of all symbols in the alphabet is equal to 1, it can achieve this at the expense of rendering the last subinterval so small that the coding is highly inefficient.