This invention concerns accurate modeling of a symbol source in a single pass in order to permit "on the fly" compression coding. The term "modeling" signifies the data processing measures necessary to obtain a profile of the source and to assist compression.
Langdon and Rissanen, "Compression of Black-White Images With Arithmetic Coding", IEEE Transactions on Communications, Vol. 29, No. 6, June 1981, describe a compression system having separate model and code units. The model approximates the statistical characteristics of the symbol source. Each symbol is simultaneously applied to the model and encoding units. Responsively, the model conditions the encoder with respect to encoding one or more subsequent symbols. Advantageously, a model of a source emulates a reduced number of symbols and strings and brings about an economy of internal states and memory size. This is brought about by "modeling" only the most popular symbols. Further, rarely occurring symbols can of course be transmitted in the clear without substantially affecting compression.
A model comprises a finite state machine (FSM) and statistics of the symbol source. An encoder is also a finite state machine as for example described in Rissanen and Langdon, "Arithmetic Coding", IBM Journal of Research and Development, Vol. 23, No. 2, March 1979, at pp. 149-162. Together, the model and encoding units execute a coding function. In this regard, a coding function maps each string in a source alphabet to a counterpart string in a code alphabet. To reduce computational complexity, the mapping of an arbitrarily long source string is not made to its image in the set of code strings in a single step. Rather, most coding functions are recursive in that the next function value depends upon the instantaneous function value, source symbol, and other attributes. Typically, a function consists of a series of operations, applied to each successive symbol of the source string from left to right. In order to be physically realizable, recursive functions have a finite amount of memory.
Encoding and decoding functions are both performed by FSM's. For each encoding or decoding operation, the FSM accepts an input, delivers an output, and changes its internal state. The coder accepts a string of source symbols, one at a time, and performs an invertible transformation into a code string. Since the model and the coder are distinct FSM's the model state is distinguished from that of the encoder state.
In the design of a lossless data compression system, the source is initially "modeled" and then a code is devised for the modeled source. Illustratively, consider the natural alphabet in which strings are written consisting of eight-bit binary symbols or "bytes". Rarely do all 256 possible symbols for an eight-bit byte appear in any given sequence. Depending upon the symbol source, only a minority in the order of 40 to 80 characters commonly occur. This permits an economy of internal states and memory size to be made.
If the probability of generating any source symbol were completely independent of any previous source symbol, then the source is said to be a zero order MARKOV (memoryless) source. However, a memoryless source is rare, and more frequently the source symbols may exercise an intersymbol influence. This is reflected in sets of conditional events and probabilities. A more general information source with n distinguishable symbols is one in which the occurrence of the source symbol is affected by a finite number of m preceding symbols. Such a source is termed an mth order MARKOV source. For an mth order MARKOV source, the conditional probability of emitting a given symbol is determined by the m preceding symbols. At any one time, therefore, the m preceding symbols define the state of the mth order MARKOV source at that time. Since there are q possible symbols, then an mth order MARKOV source will have q.sup.m possible states. As symbols are emitted from the source, the state changes. Thus, for a 256 symbol alphabet, a second order MARKOV source or model thereof would require 256.sup.2 =64,000 states. For this reason, higher order MARKOV modeling is frequently impracticable.
The aforementioned Langdon, et al., reference describing black-white image compression discloses a two stage modeling in which the first stage is a neighborhood template for identifying the context which in turn identifies conditional probability distribution of the next pel to be encoded given the instantaneous pel. The notion "context" or conditioning class is a generalization of "state". The actual statistics or counts determining the distribution in either case are derived on the fly from the input pel stream. That is, there exists a fixed model of the symbol source conditioning class or context and an adaptive modeling of the conditional probability distribution within each context within one pass of the data.
In the past, attempts to manage first order MARKOV model requirements are handled in diverse ways. For instance, Mommens and Raviv, "Coding For Data Compaction", IBM Report RC5150, Nov. 26, 1974, describes the use of a first order MARKOV model for decomposing a higher order character stream into a multiple of lower order character streams. This involved multiple passes. The first pass ascertained the conditional probability of a first full first order MARKOV model of the symbol stream. The number of states was reduced by forming equivalence classes having approximately equal conditional probability distribution. A compression code was then formed for each equivalence class. A second pass assigning the code to the symbols was required. Lastly, Arnold, et al., U.S. Pat. No. 4,099,257, used a fixed partial first order MARKOV FSM to context encode characters common to two alphabets. For instance, if a t was always lower case, it would be encoded as an upper case t if the symbol preceding it was a ".".