1. Field of the Invention
The present invention generally relates to data coding methods and devices, and particularly relates to a method and a device which encodes input data according to a variable-length coding scheme based on frequency of data-series occurrence in past data, and decodes coded data based on a history of past decoded data.
2. Description of the Related Art
In recent years, data used in computers has come to include a wider variety of data types such as character codes, vector information, images, etc. With this widening variety of data types, a rapid increase has been seen in the amount of data treated in computers.
When a large amount of data is treated, redundancies included in the data are eliminated as much as possible to compress the data, which leads to a reduction in memory volumes and an increase in data-transmission speed. An example of methods used in such data compression includes the Universal Coding method, which can compress any types of data. The same as the Universal Coding method, the present invention can be applied not only to the compression of character codes but also to the compression of a wide variety of data. Hereinafter, terminology used in information theory is employed, so that a character refers to a unit of one word in data, and a character string refers to a series of words.
Methods of compressing text data, files, etc. include a dictionary-type coding method which utilizes analogies in data series, and a statistical-type coding method which utilizes occurrence frequency of data-series. A typical method for the statistical-type coding is an arithmetic coding method. This method is regarded as being capable of coding data with the maximum efficiency when occurrence frequency for data from an information source can be known. This method does not encode each character one by one as does the Huffman coding method, but treats a character string in its entirety to boost an efficiency in data compression. The term "arithmetic coding" comes from the fact that coded words are derived through calculation and represented by binary numbers including numbers below a decimal point.
These coding (decoding) methods require a large number of steps, so that processing time is long and difficult to shorten. Thus, it is required to simplify the steps and make the processing faster.
In general, there are two types of the arithmetic coding method, i.e., a binary arithmetic coding method and a multi-level (more than two) arithmetic coding method. Details of the binary arithmetic coding method and the multi-level arithmetic coding method may be made to "Arithmetic Coding for Data Compression" by Jan H. Witten et al. (Commun. of ACM Vol. 30, No. 6, pp.520-540) and "An Adaptive Dependency Source Model for Data Compression Scheme" by D. M. Abrahamson (Commun. of ACM, Vol. 32, No. 1, pp.7783).
An example of the multi-level arithmetic coding scheme is shown in FIGS. 1A and 1B. As shown in FIG. 1B, when events of characters (hereinafter called symbols) are to be encoded, a range P (0.ltoreq.P&lt;1) (hereinafter referred to as [0,1)) is divided by the number of the symbols. Here, a length of a divided range, which corresponds to one of the symbols, is in proportion to frequency of occurrence of the symbol. The frequency of occurrence is shown in FIG. 1A for symbols "a", "b", "c", "d", "e".
Imagine that an incoming character string is "abe". First, the range corresponding to the first symbol in that character string is selected. Then, the selected range is divided by the number of all the symbols as in the same manner as before. Here, lengths of newly divided ranges are also proportional to frequency of occurrence of each symbol.
Further, the range corresponding to the second symbol in the character string is selected from the selected range corresponding to the first symbol. Finally, the range corresponding to the third symbol in the character string is selected from the selected range corresponding to the second symbol. The finally selected range represents the character string "abe". These divisions and selections of the ranges are shown in FIG. 1B. Also, in FIG. 1B, the finally selected range representing the character string "abe" is shown.
Compressed codes for the character string "abe" are binary codes which represent an arbitrary point within the finally selected range. Through the same procedure, compressed codes for any character string can be obtained.
Methods of dividing ranges include a static coding method which divides ranges according to frequency of occurrence prepared in advance. In this case, the frequency of occurrence used for dividing ranges does not reflect actual frequency of occurrence of the incoming character string. Also, the methods of dividing ranges include a semi-adaptive coding method which divides ranges according to frequency of occurrence derived by scanning the entire character string before coding. Furthermore, the methods of dividing ranges include an adaptive coding method which updates ranges by calculating frequency of occurrence for each incoming symbol. The present invention is categorized as one of adaptive coding methods, and can encode data through one path.
In the static or semi-adaptive coding methods, all the symbols appearing in incoming data are registered in a context tree as shown in FIG. 1C, and the range is divided beforehand by these symbols. In the adaptive coding method, however, symbols appearing in the incoming data are registered in the context tree as the coding of the data proceeds.
In the adaptive coding method, a code representing "not registered" (hereinafter referred to as ESC or an escape code) is listed in the context tree. When an incoming symbol is not registered in the context tree, ESC and raw data are output. Then, the symbol is added to the context tree, frequency of occurrence is recalculated, and ranges are newly divided according to the recalculated frequency of occurrence. In order to decode the codes, a range represented by the codes is obtained to reconstruct the symbol.
In order to further enhance the compression rate, a dependency of an incoming symbol on preceding symbols is taken into account. In this case, conditional probability of occurrence is coded through a dynamic variable-length coding scheme by taking into consideration an inter-symbol dependency (dependency of an incoming symbol on preceding symbols).
FIG. 2 is a block diagram which carries out the statistical-type coding method by incorporating the inter-symbol dependency. Input data is supplied to a context collecting unit 300. The context collecting unit 300 collects contexts representing inter-symbol relationships in character strings of the input data, and obtains conditional probabilities. The conditional probabilities obtained by the context collecting unit 300 is supplied to a variable-length coding unit 301. The variable-length coding unit 301 encodes the conditional probabilities through the variable-length coding scheme.
FIGS. 3A and 3B are illustrative drawings for explaining the context tree when the inter-symbol dependency is taken into account. In this case, the context tree becomes that of one or more orders in contrast to that of zero order shown in FIG. 1C. Here, the term "order" refers to the number of symbols in a context.
As shown in FIG. 3A, a character string including "abc" is supplied. "c" is the symbol to be coded, and " . . . ab" is the context for coding the symbol "c". Such a context is represented by the context tree shown in FIG. 3B. As shown in FIG. 3B, the context is represented by a tree having a root and branches which represent each symbol appearing in the character string. In FIG. 3B, the symbol "c" to be coded is encircled. A branch above the symbol "c" to be coded is a preceding symbol "b", and another branch above the symbol "b" is a symbol "a" preceding the symbol "b". In this manner, the context " . . . ab" and the symbol "c" to be coded can be represented in the context tree. Here, each time a symbol is accessed, the number of occurrence is counted up. Then, the conditional probabilities are calculated based on the number of occurrence for each symbol.
Methods of collecting contexts for calculating the conditional probability include a method of using contexts of a fixed order and a method of using blending contexts.
In the method of using contexts of a fixed order, the number of symbols as a condition for the conditional probability is fixed. For example, when the order of a context is two, two symbols preceding a symbol to be coded are considered. Then, contexts comprising these two symbols and a following one symbol are collected to calculate a conditional probability p(y.vertline.x.sub.1, x.sub.2). Here, y denotes the symbol to be coded, and x.sub.1 and x.sub.2 denote an immediately preceding two symbols.
In the method of using blending contexts, the number of preceding symbols to be taken into consideration is not fixed. When a particular context appears frequently, the order is raised to elongate the length of the context. Otherwise, the order is kept at a small number. In general, when a particular set of preceding conditional symbols does not frequently appear, an estimation of the conditional probability using that particular set tends to become inaccurate. On the other hand, when a particular set of preceding conditional symbols frequently appears, an estimation of the conditional probability is accurate, leaving room for raising the order. Also, in general, when correlations between symbols are large, a higher compression rate can be obtained by using a higher order of contexts. On the other hand, when correlations between symbols are small, using a higher order of contexts results in a lower compression rate. In the blending contexts, the order of the contexts is raised by adapting to the input data, so that a higher compression rate is obtained than in the case of the fixed-order contexts.
In the multi-level arithmetic coding method described above, ranges are divided according to frequency of occurrence. Thus, a dictionary as shown in FIG. 4A is necessary for counting and recording an occurrence of each symbol.
FIG. 4A shows an example of a dictionary with no inter-symbol dependency for simplicity of explanation. When the inter-symbol dependency is taken into consideration, each symbol is categorized under each context.
In the dictionary of FIG. 4A, frequency and cumulative frequency are assigned to each symbol. Symbols are arranged from the top of the list in descending order of frequency. The cumulative frequency of a given symbol is a sum of the frequency from the rarest symbol at the bottom of the list to the given symbol, and is used when ranges are divided.
When a given symbol appears in the input data, the list in the dictionary should be rearranged according to the updated frequency of occurrence after the encoding of the given symbol. For example, when a symbol A is encoded in FIG. 4A and the frequency of occurrence for the symbol A is incremented, the list should be rearranged as shown in FIG. 4B. This is done by exchanging the symbol A with a symbol N, which is located at the top of symbols having the same frequency as the post-updated frequency of the symbol A.
When the frequency of the symbol A is incremented by 1 as shown in FIG. 4B, cumulative frequency for all symbols listed above the symbol A has to be changed. This is done by incrementing the cumulative frequency of these symbols by 1, as shown in FIG. 4C.
FIG. 5 shows a block diagram of a related-art coding device which carries out such an updating process of rearranging the dictionary. FIG. 6 shows a flowchart of the updating process.
The coding device includes a context-tree holding unit 1, a coding unit 2, a cumulative-frequency holding unit 3, a frequency holding unit 4, a sorting unit 5, a frequency updating unit 6, a cumulative-frequency updating unit 7, and an update controlling unit 8.
The context-tree holding unit 1 stores a context tree which is generated based on input data of the past. The coding unit 2 receives symbols and encodes them based on the context tree stored in the context-tree holding unit 1. The cumulative-frequency holding unit 3 stores cumulative frequency of each symbol which makes up the context tree. The frequency holding unit 4 stores frequency of occurrence for each symbol. The sorting unit 5 reads the frequency of occurrence for each symbol from the frequency holding unit 4, and reshapes the context tree according to the frequency of occurrence. The frequency updating unit 6 updates the frequency of occurrence stored in the frequency holding unit 4. The cumulative-frequency updating unit 7 updates the cumulative frequency stored in the cumulative-frequency holding unit 3. The update controlling unit 8 controls the sorting unit 5, the frequency updating unit 6, and the cumulative-frequency updating unit 7 to carry out a respective updating process when a symbol is supplied to the coding unit 2.
In FIG. 6, at a step S1, a symbol K is coded by the coding unit 2. Arrival of the symbol K is detected by the update controlling unit 8, which then controls the sorting unit 5, the frequency updating unit 6, and the cumulative-frequency updating unit 7. At a step S2, the sorting unit 5 searches for a symbol N which is located at the top of the list among symbols having the same frequency as that of the symbol K. At a step S3, the symbol K is exchanged with the symbol N. At a step S4, the frequency updating unit 7 increments the frequency of the symbol K by 1.
At a step S5, the cumulative-frequency updating unit 7 shifts a pointer for pointing to a symbol such that the pointer points to an immediately upper symbol. At a step S6, the cumulative-frequency updating unit 7 increments by 1 cumulative frequency of a symbol pointed to by the pointer.
At a step S7, the update controlling unit 8 checks whether the symbol pointed to by the pointer is at the top of the list. If it is, the procedure ends. If it is not, the procedure goes back to the step S5, and repeats the steps S5, S6, and S7.
In this manner, the frequency and the cumulative frequency are updated through the processes of the steps S3 to S7 each time a symbol is updated.
Namely, in the coding/decoding methods of the related art, each time a symbol is supplied, the frequency of occurrence and the cumulative frequency are updated, and symbols are sorted to be arranged in a descending order of frequency. Thus, a long processing time is required for treating a large number of symbols. This leads to a lower processing speed.
Accordingly, there is a need in the field of coding/decoding schemes for a coding/decoding method and a device which have a faster processing speed.