1. Field of the Invention
The present invention relates to a data compression system and data restoration system that adopts probability statistical coding such as arithmetic coding in which data such as character codes or images is encoded byte by byte. More particularly, this invention is concerned with a data compression system and data restoration system enjoying improved processing performance due to pipelined processing.
2. Description of the Related Art
In recent years, various kinds of data such as character codes, vector information, and images have been handled by a computer. The amount of data to be handled is increasing rapidly. For handing a large amount of data, redundancies in data are eliminated to reduce the amount of data. Consequently, storage capacity can be reduced and data can be transmitted far away to remote places.
One of the methods of compressing various kinds of data using one algorithm is universal coding. The universal coding falls into various techniques. Probability statistical coding such as arithmetic coding is a typical one of the techniques. The present invention can be adapted to various kinds of data not limited to character codes. Hereinafter, according to the terms employed in the information theory, a unit of data or one word shall be referred to as a character, and data composed of any number of consecutive words shall be referred to as a character string.
One of the methods of compressing various kinds of data using one algorithm is universal coding. The universal coding falls into various techniques. There are two typical techniques; dictionary coding and statistical coding.
1. Dictionary coding or Lempel-Ziv coding
Typical technique: LZW, LZSS PA1 Typical technique: multi-value arithmetic coding, dynamic Huffman coding PA1 1. the dictionary coding handles past data in the form of a character string itself, while the probability statistical coding handles it in the form of a probability of occurrence; and PA1 2. the dictionary coding handles fixed-length data as an object of encoding, while the probability statistical coding handles variable-length (in principle) data. Thus, the dictionary coding and probability statistical coding are fundamentally different from each other in terms of compression mechanism. Herein, multi-value arithmetic coding that handles mainly a data stream or byte stream of an English text or the like is taken as an example of universal coding. PA1 1. if the high level H4 or low level L4 is equal to or larger than 1/2, bit 1 is produced, while if it falls below 1/2, bit 0 is produced; and PA1 2. if the high level H4 or low level L4 is equal to or larger than 1/4 and is equal to or smaller than 3/4, bit 1 is produced, while if it falls outside this range, bit 0 is produced. PA1 D(i)=i PA1 where i is 1, 2, 3, etc., and A, and A is the number of alphabets or symbols and assumes 256. PA1 I(i)=i PA1 where i is 1, 2, etc., and A. PA1 freq(i)=1 PA1 where i is 1, 2, etc., and A. PA1 cum freq(i)=A-1
2. Statistical coding
The dictionary coding is such that past character strings (whose length is variable) are registered in a table called a dictionary, and a subsequent character string is encoded according to information (whose code length is constant) of the location in the dictionary of the longest character string registered in the dictionary. This technique is based on variable-fixed coding (VF) in which a long variable-length character string, for example, several tens of characters are encoded according to fixed-length information of, for example, 12 bits.
For the details of a dictionary coding algorithm, refer to Section 8 Dictionary Techniques in "Text Compression" (Prentice-Hall, 1990) written by T. C. Bell et al.sup.(1). By contrast, the probability statistical coding is such that the probability of occurrence of a past individual character (since one character is concerned, the code length is fixed) (including a conditional probability subordinate to an immediately preceding character string) is calculated, and a succeeding character is encoded according to statistical (entropy) information (whose code length is variable) reflecting the probability of occurrence calculated. This technique is based on fixed-variable coding (FV) in which characters (with fixed lengths) are encoded one by one according to statistical information (variable) reflecting the probabilities of occurrence thereof (fixed).
For the details of a probability statistical coding algorithm, refer to Section 5 From Probabilities to Bits in "Test Compression" (Prentice-Hall, 1990) written by T. C. Bell et al.sup.(2). Typical statistical coding techniques include Huffman coding and arithmetic coding.
For the details of context modeling for obtaining the subordinate relationship of an immediately preceding character string, refer to Section 6 Context Modeling in "Text Compression" (Prentice-Hall, 1990) written by T. C. Bell et al.sup.(3). Herein, the subordinate relationship to an immediately preceding character string is expressed with several characters at most, though an infinitely long character string is used in the dictionary coding.
Consequently, the dictionary coding and probability statistical coding are different from each other in the following points:
Two encoding techniques to which the arithmetic coding is broadly divided have been proposed; binary arithmetic coding and multi-value arithmetic coding. The encoding techniques differ from each other in a point that the binary arithmetic coding handles two digits corresponding to bits 0 and 1 as a unit of data to be encoded, while the multi-value arithmetic coding handles many digits corresponding to, for example, one byte of 8 bits as a unit of data to be encoded. A typical example of implementation of the binary arithmetic coding is a QM coder employed in JBIG entropy coding that is a standard technique of binary image compression recommended by the CCITT or ISO. For details, for example, refer to Chapter 3 Arithmetic Coding in "International Standards of Multiprocessor Media Coding" (Maruzen, p68-82, June 1991).sup.(4).
Typical examples of the multi-value arithmetic coding are Witten coding and Abrahanson coding. For details, for example, refer to "Arithmetic Coding for Data Compression" (Communications of Association for Computing Machinery, Vol. 30(6), p.520-540, July 1987) written by I. H. Witten et al.sup.(5). and "An Adaptive Dependency Source Mode for Data Compression" (Communications of Association for Computing Machinery, Vol. 32(1), p.77-83, January 1989).sup.(6). The binary arithmetic coding alone is utilized in practice for the reason that it is suitable for images. However, since the multi-value arithmetic coding enjoys the highest compression performance, practical utilization of the multi-value arithmetic coding is expected.
The probability statistical coding requires, as shown in FIG. 1, an occurrence frequency modeling unit 400 and entropy coding unit 402. The occurrence frequency modeling unit 400 fetches an input character and an immediately preceding character string (context) and calculates the occurrence frequency of the input character in terms of the subordinate relationship to the immediately preceding character string. The entropy coding unit 402 carries out variable-length encoding to produce a code dynamically on the basis of the occurrence frequency calculated by the occurrence frequency modeling unit 400. The further details will be described. Take for instance a character string abc composed of three characters a, b, and c as shown in FIG. 2A. The relationship to the immediately preceding character string is expressed in the form of a tree structure shown in FIG. 2B. The occurrence frequency modeling unit 400 counts up the number of occurrences at every occurrence of a character string linking characters at nodes of the tree structure shown in FIG. 2B, and thus obtains the subordinate relationship to the immediately preceding character string, for example, a conditional probability. A context acquisition method for obtaining the subordinate relationship of such an input character to an immediately preceding character string falls into a method of acquiring a fixed-order context and a method of acquiring a blend context. Herein, the number of characters implied by a context is referred to as an order. The method of acquiring a fixed-order context is a method for fixing the number of characters in a context. Taking a two-order context for instance, the occurrence frequency modeling unit 400 acquires the context of a character linked to two immediately preceding characters x.sub.2 and x.sub.1, obtains the subordinate relationship of a character y succeeding the immediately preceding characters x.sub.2 and x.sub.1, for example, a conditional probability .rho. (y.vertline.x.sub.1, x.sub.2), and hands the obtained probability to the entropy coding unit 402. Here, y is an input character concerned, and x.sub.1 and x.sub.2 are first and second immediately preceding characters. The method of acquiring a blend context is a method in which the orders of contexts are mixed. In the case of a fixed-order context, if an immediately preceding character string hardly appears, the estimate of the subordinate relationship to the immediately preceding character string becomes uncertain. By contrast, if the immediately preceding character string appears frequently, the estimate of the subordinate relationship to the immediately preceding character string becomes more accurate and offers the possibility of increasing the order of a context. In general, as the larger-order context in which an immediately preceding character string is longer is used, a bias of characters can be grasped more easily, and high compression efficiency can be provided. However, when data that is a large-order context whose characters have a feeble correlation is compressed, the compression efficiency is low. An attempted solution of this kind of problem is a blend context made by mixing contexts having different orders. The method of acquiring a blend context is such that the order of an immediately preceding context is not fixed, when a context of individual contexts appears frequently, the subordinate relationship to a large-order context is drawn out. When a context appears hardly, the subordinate relationship to a small-order context is drawn out.
The entropy coding unit 402 produces a code according to an occurrence frequency provided by the occurrence frequency modeling unit 400. Typical coding to be implemented in the entropy coding unit 402 for producing a code dynamically according to the number of occurrences obtained by the occurrence frequency modeling unit 400 includes arithmetic coding, dynamic Huffman coding, and self-organization coding. The arithmetic coding is thought to offer the highest encoding efficiency because since a code is produced through computation based on the occurrence probability of each character, a code can be assigned even at a rate of one bit per one character or less.
FIGS. 3A to 3C illustrate a procedure of mult:i-value arithmetic coding. A character string of input characters which has a length of a plurality of bits, for example, a plurality of bytes is associated with one point on a number line [0, 1] and expressed with one code. For brevity's sake, a character string composed of four characters a, b, c, and d will be discussed. First, as shown in FIG. 3A, the occurrence frequencies of the characters are calculated. What is referred to as an occurrence frequency is a probability calculated by dividing the number of occurrences of each character by a total number of occurrences. For example, the occurrence frequency of character a is 0.15, that of character b is 0.25, that of character c is 0.35, and that of character d is 0.25. Next, using the occurrence frequencies shown in FIG. 3A, the characters are rearranged in descending order of frequency. As shown in FIG. 3B, cumulative occurrence frequencies are calculated. What is referred to as a cumulative occurrence frequency is a sum of occurrence frequencies of characters ranking lower than a character concerned. Specifically, the cumulative occurrence frequency of character c having the highest occurrence frequency is a sum of the occurrence frequencies of characters b, d, and a, that is, 0.65. Likewise, the cumulative occurrence frequencies of the other characters b, d, and a are 0.40, 0.15, and 0.0 respectively. In this state, for example, when character c is input, as; shown in FIG. 3C, a new interval 404 within the encoding interval [0, 1] defined with a number line is obtained on the basis of the occurrence frequency freq[c] of input character c that is 0.35 and the cumulative occurrence frequency cum freq[c] thereof that is 0.65. More particularly, since the high level H1 of the encoding interval [0, 1] defined with a number line is 1, the low level L1 thereof is 0, and the interval width W1 thereof is 1, the high level H2, low level L2, and interval width W2 of the new interval 404 are calculated on the basis of the occurrence frequency freq[c] of input character c which is 0.35 and the cumulative occurrence frequency cum freq[c] thereof which is 0.65. That is to say, the low level L2 of the new interval 404 is calculated using the low level L1 of the previous interval and the interval width W1 thereof as follows: ##EQU1## The width W2 of the new interval 404 is calculated as follows: ##EQU2## The high level (upper extreme) H2 of the new interval 404 is calculated as follows: ##EQU3## Since character c is input, the number of occurrences of character c is incremented by one, and the total number of occurrences is incremented by one. Accordingly, the occurrence frequencies of characters a, b, c, and d and the cumulative occurrence frequencies thereof are updated. For brevity's sake, the occurrence frequencies and cumulative occurrence frequencies shown in FIGS. 3A and 3B are supposed to remain unchanged. When character a is input, the previous interval 404 is regarded as a new interval [0, 1]. The low level L3, interval width W3, and high level H3 of the new interval 406 within the interval 404 are calculated on the basis of the occurrence frequency freq[a] of input character a that is 0.15 and the cumulative occurrence frequency cum freq[a] thereof that is 0.15. ##EQU4##
When character d is input, the previous interval 406 is regarded as a new interval [0, 1]. The low level L4, interval width W4, and high level H4 of a new interval 408 within the interval 406 are calculated on the basis of the occurrence frequency freq[d] of input character d that is 0.25 and the cumulative occurrence frequency cum freq[d] thereof that is 0.4. ##EQU5## If the input character d is a last character, any values defining the interval 408, for example, any values determined by the high level and low level of the interval 408 are output as an arithmetic code. To be more specific, the previous interval 406 is normalized to [0, 1] and divided into four subdivision intervals according to thresholds 1/4, 1/2, and 3/4. A subdivision interval within the normalized previous interval 406 to which the high level H4 and low level L4 of the last interval 408 belong is detected and used to produce a code. A code is produced under the following conditions in relation to the previous interval 406:
In the case of the last interval 408, since the high level H4 equals to 0.7445, bit 1 is produced under the above Condition 1. Under Condition 2, bit 1 is also produced. Since the low level L4 equals to 0.7235, bit 1 is produced under Condition 1. Under Condition 2, bit 1 is also produced. The arithmetic code of the character string cad is 1111. In practice, the occurrence frequency and cumulative occurrence frequency of a character are not dealt with directly. That is to say, when a character is input and encoded, the number of occurrences of the character, a cumulative number of occurrences thereof, and a total number of occurrences are calculated. When an occurrence frequency and cumulative occurrence frequency are needed, the number of occurrences is divided by the total number of occurrences. Thus, an occurrence frequency is calculated. The cumulative number of occurrences is divided by the total number of occurrences, whereby a cumulative occurrence frequency is calculated. From this viewpoint, the occurrence frequency is the number of occurrences normalized relative to the total number of occurrences, and the cumulative occurrence frequency is a cumulative number of occurrences normalized relative to the total number of occurrences. According to this kind of multi-value arithmetic coding, a character string having a higher occurrence frequency provides a wider last interval and can be expressed with a shorter code. This results in a compressed amount of data. This method enjoys high compression efficiency because no restrictions are imposed on a minimum unit of bit representation of a code, and the minimum unit can be set to one bit or less.
FIG. 4 is a block diagram of a known data compression system adopting arithmetic coding. The data compression system comprises an occurrence frequency rank rearranging unit 410, counter 412, frequency data storage unit 414, dictionary 416, and arithmetic coding unit 418. In this example, the number of occurrences freq[ ] is used instead of an occurrence frequency, and a cumulative number of occurrences cum freq[ ] is used instead of a cumulative occurrence frequency. The dictionary 416 may be incorporated in the frequency data storage unit 414. The counter 412 counts up the number of occurrences of an input character, freq[ ], calculates the cumulative number of occurrences of the character, cum freq[ ], and a total number of occurrences, cum freq[0], and stores them in the frequency data storage unit 414. The frequency rank rearranging unit 410 rearranges all characters existent in the frequency data storage unit 414 in descending order of number of occurrences freq[ ] at every input of a character, and stores the numbers of occurrences freq[ ] and cumulative numbers of occurrences cum freq[ ] in relation to register numbers indicating ranks. At the same time, symbols in the dictionary 416 are rearranged in one-to-one correspondence with the register numbers indicating ranks in descending order of number of occurrences stored in the frequency data storage unit 414. In response to a register number, which indicates a rank of an input character k and is retrieved by referencing the dictionary 416, sent from the frequency rank rearranging unit 410, the arithmetic coding unit 418 references the frequency data storage unit 414 according to the register number and obtains the number of occurrences of the input character k, freq[k], the cumulative number of occurrences thereof, cum fre[k], and the total number of occurrences cum freq[0]. The number of occurrences freq[k] and cumulative number of occurrences cum freq[k] are divided by the total number of occurrences cum freq[0], whereby an occurrence frequency and cumulative occurrence frequency are calculated. Based on the calculated occurrence frequency and cumulative occurrence frequency, a new interval is computed.
The operations of the system shown in FIG. 4 will be described. The frequency rank rearranging unit 410 references the dictionary 416 according to an input character k so as to retrieve a register number indicating the rank of the character in terms of number of occurrences, and outputs the register number to the arithmetic coding unit 418. The arithmetic coding unit 418 references the frequency data storage unit 414 according to the register number (rank) sent from the frequency rank rearranging unit 410 so as to obtain the number of occurrences, freq[k], of the input character k, the cumulative number of occurrences thereof, cum freq[k], and the total number of occurrences, cum freq[0]. An occurrence frequency and cumulative occurrence frequency are calculated by dividing the number of occurrences freq[k] and cumulative number of occurrences cum freq[k] by the total number of occurrences cum freq[0]. Based on a previous interval width Wk-1, the high level Hk: and low level Lk of a new interval Wk are calculated. If the input character is a last character, any values defining the new interval are output as a code. By the way, the counter 412 increments by one the number of occurrences, freq[k], of the input character k, the cumulative number of occurrences thereof, cum freq[k], and the total number of occurrences cum freq[0], and updates values associated with the register numbers in the frequency data storage unit 414. The frequency rank rearranging unit 410 then rearranges the contents of the frequency data storage unit 414 and dictionary 416 in descending order of updated number of occurrences freq[ ]. The flowchart of FIG. 5 describes multi-value arithmetic coding, wherein the one-fold history of an occurring character not taking account of a history relative to characters including an immediately preceding character, that is, the subordinate relationship of the occurring character to a zero-order context is estimated. The initial values defining an encoding interval are as follows: the high level H0 is 1, the low level L0 is 1, and the interval width WO is 1.0. i denotes ranks (register numbers) of characters in the dictionary in which the characters are arranged in descending order of number of occurrences, and assumes 1, 2, 3, etc., and A. freq[i] denotes numbers of occurrences of i-th ranks (register numbers) in the dictionary in which the characters are rearranged in descending order of number of occurrences. cum freq[i] denotes cumulative numbers of occurrences of characters having i-th ranks in the dictionary. Moreover, I assumes 1, 2, 3, etc., and A. cum freq[1] denotes the cumulative number of occurrences of a character ranking first. cum freq[A] denotes the cumulative number of occurrences of a character having the lowest rank A. Furthermore, an encoding interval is normalized to [1, 0] where 1 is associated with cum freq[1] and 0 is associated with cum freq[A]. At: step S1, the initial values below are set.
1. All single characters are allocated to items i of the dictionary D.
2. The i-th ranks (register numbers) are assigned to the characters.
3. The numbers of occurrences of all the characters are initialized.
4. The cumulative numbers of all the characters are initialized.
After the foregoing initialization is completed, a leading character k of an input character string of source data is input at step S2. Control is then passed to step S3, and rank j of the input character k which is a register number is retrieved from the dictionary. The rank j is provided as j=I(k). A list table that lists cumulative numbers of occurrences in one-to-one correspondence with code intervals is referenced on the basis of rank j, whereby a cumulative number of occurrences cum freq[j] is retrieved. This operation is expressed as i=D(j). Arithmetic coding is carried out on the basis of rank j retrieved from the dictionary. The arithmetic coding based on rank j is such that: the cumulative number of occurrences of the character having the rank j, cum freq[j] is divided by the total number of occurrences cum freq[0] in order to obtain a cumulative occurrence frequency; a new interval is defined on the basis of a previous interval width and low level; and if an input character is a last character having a given number of bytes that is regarded as a unit of encoding, any values defining the new interval are output as a code. At step S4, if characters having the same number of occurrences as the input character k of rank j rank high, the characters ranking high are rearranged together with the numbers of occurrences and cumulative numbers: of occurrences. First, among characters ranking lower than the character of rank j, a character of rank r that is immediately lower than the character k of rank j and has the different number of occurrences freq[k] from the character k is searched for. In shorts, a character of rank r satisfying freq[r]!=freq[j] is searched for. Note that ! means "different." Furthermore, assume that the number of occurrences associated with an interval expressing the r-th character in the dictionary ranks I(r)=s in the list table, and the cumulative number of occurrences associated with the interval ranks D(r)=t in the list table. The j-th and r-th characters in the dictionary are switched, the j-th and s-th numbers of occurrences in the list table listing the numbers of occurrences vs. intervals are switched, and the j-th and t-th cumulative numbers of occurrences in the list table listing the cumulative numbers of occurrences vs. intervals are switched. In other words, switching of the numbers of occurrences I(j)=s, switching of cumulative numbers of occurrences D(j)=t, switching of dictionary ranks I(r)=j, and switching of dictionary characters D(r)=D(j) are carried out. Assuming that characters of, for example, ranks j-1, j-2, and j-3 higher than rank j of the input character k, which have the same number of occurrences as the input character k, are present, the highest rank r=j-3a is obtained as a switching destination, and the characters of ranks r and j are switched together with the numbers of occurrences and cumulative numbers of occurrences. At step S5, the number of occurrences freq[k] of the switched input character k is incremented by one. The cumulative numbers of occurrences cum freq[ ] associated with ranks r+1 and higher are incremented by one. Needless to say, the total number of occurrences cum freq[0] is also incremented by one. Control is then returned to step S2, and one character is input. The foregoing processing is repeated until no source data is present.
The flowchart of FIG. 6 describes multi-value arithmetic coding in which the two-fold history of an occurring character taking account of the history relative to an immediately preceding character, that is, the subordinate relationship of the character to a one-order context is estimated (refer to the reference literature (5) written by D. M. Abrahamson). In the two-fold history estimation, a combination of two consecutive characters is taken account, a dictionary is created for each character, and the occurrence frequencies of characters immediately succeeding each character are registered in the dictionary. For example, in the dictionary for character a, characters succeeding character a, for example, b of ab and c of ac, are registered. The numbers of occurrences of the characters succeeding character a are obtained and registered. At step S1 of initialization, values D(p, i) are allocated to the items of the dictionary D in one-to-one correspondence with characters i succeeding character p. Here, p and i assume 1, 2, etc., and A. A denotes the number of alphabets. Numbers assigned to characters shall be I(p, 1). Here, p and i assume 1, 2, etc. and A. The other operations are identical to those in the one-fold history estimation described in FIG. 5.
Aside from arithmetic coding, entropy coding in which the tree structure shown in FIG. 2B is used for dynamic encoding includes dynamic Huffman coding and splay coding that is a kind of self-organization coding. For the details of the dynamic Huffman coding, refer to "Dynamic Huffman Coding" (Journal of Algorithms, Vol. 6, p.163-180, 1985) written by D. E. Knuth, "Design and Analysis of Dynamic Huffman Codes" (Journal of ACM, Vol. 34, No. 4, p.825-845, 1987) written by J. S. Vitter, and Chapter 5 in "Guide to Document Data Compression Algorithms" (CQ Publishing Co., Ltd. (1994), Patent No. 94-11634) written by Tomohiko Uematsu.sup.(6). For the details of the splay coding, refer to "Application of Splay Tree to Data Compression" (Commun. of ACM, Vol. 31, No. 8, p.996-1007, 1987)(Patent No. 94-01147) written by D. W. Jones.sup.(7).
However, the known probability statistical coding poses a problem that it is quite time-consuming. This is because, as shown in FIG. 1, after the occurrence frequency modeling unit 400 calculates an occurrence frequency, the entropy coding unit 402 carries out dynamic encoding. When the occurrence frequency modeling unit 402 handles a blend context, the subordinate relationship such as a conditional probability to a context is obtained orderly from a large-order context to a small-order context. The processing time is therefore very long. The known probability statistical coding cannot therefore be utilized in practice.
Moreover, the known coding is fundamentally intended to handle an English text mainly, for example, ASCII 8-bit codes and perform byte-by-byte encoding in a bytes-stream direction.
The known coding techniques are not very efficient in compression of data whose word structure consists of a plurality of bytes, such as, a Uni code of two bytes long adopted as an international language code, a Japanese code, full-color image data composed of three-byte red, green, and blue data, and a 4-byte or 8-byte program code.
In short, according to the known coding techniques, since data whose word structure consists of a plurality of bytes is processed byte by byte, the processing is time-consuming. For example, when byte-by-byte encoding is adapted to an example of extending the length of data into a word length, if one byte is extended to two bytes, an amount of data, processing time, and storage capacity that are 256 times as large as those needed for efficient compression of one byte are needed for satisfactorily efficient compression of two bytes. The known coding techniques cannot therefore be used in practice.