1. Field of the Invention
The present invention relates to a coding apparatus and a decoding apparatus which can be optimally applied in compressing and reconstructing various data such as CAD data, document data, etc.
2. Description of the Related Art
Recently, an increasing volume of various types of data such as character codes, image data, etc. have been processed in a computer. When such large volume of data is stored and transmitted to a distant destination, it is common to compress the data with the redundant portion of the data removed to reduce the storage capacity and improve the transmission speed.
There are two common data compressing systems. They are a dictionary type coding system based on the similarity in data sequences; and a probability statistic type coding system based on the frequency of occurrences of data strings.
A typical example of the dictionary type coding system is an LZ77 system and an LZ78 system.
In the LZ77 system, a predetermined buffer is provided, the position of the previous data matching in longest length is retrieved from the previously input data in the buffer, and the matching position and the matching length are used as codes.
FIG. 1 shows the coding method in the conventional LZ77 system.
In FIG. 1, assume that xe2x80x98a b a b c d e f a b c d e f g h . . . xe2x80x99 is input as data to be compressed, and each character of the data to be compressed is assigned an input number indicating an occurrence position.
First, if xe2x80x98axe2x80x99 having the input number 1 is input, then the character xe2x80x98axe2x80x99 is coded as is because it has no preceding characters. Then, when a character xe2x80x98bxe2x80x99 having the input number 2 is input, it is compared with the previously input characters. However, there are no characters matching the character xe2x80x98bxe2x80x99, the character xe2x80x98bxe2x80x99 is coded as is. Furthermore, when a character string xe2x80x98a bxe2x80x99 having the input numbers 3 and 4 is input, it is compared with the previously input character strings. As a result, since the character string matches a character string xe2x80x98a bxe2x80x99 having the input numbers 1 and 2, the character string xe2x80x98a bxe2x80x99 having the input numbers 3 and 4 is coded using the matching position and matching length. In this example, since the matching position is the position of the character xe2x80x98axe2x80x99 having the input number 1, and the matching length is 2, xe2x80x98(1, 2)xe2x80x99 is coded as the code of the character string xe2x80x98a bxe2x80x99 having the input numbers 3 and 4.
Next, when a character xe2x80x98cxe2x80x99 having the input number 5 is input, it does not match any of the previously input characters. Therefore, the character xe2x80x98cxe2x80x99 is coded as is. When a character xe2x80x98dxe2x80x99 having the input number 6 is input, it does not match any of the previously input characters. Therefore, the character xe2x80x98dxe2x80x99 is coded as is. When a character xe2x80x98exe2x80x99 having the input number 7 is input, it does not match any of the previously input characters. Therefore, the character xe2x80x98exe2x80x99 is coded as is. When a character xe2x80x98f xe2x80x99 having the input number 8 is input, it does not match any of the previously input characters. Therefore, the character xe2x80x98fxe2x80x99 is coded as is.
Then, when a character string xe2x80x98a b c d e fxe2x80x99 having the input numbers 9 through 14 is input, it matches a character string xe2x80x98a b c d e fxe2x80x99 having the input numbers 3 through 8. Therefore, the character string xe2x80x98a b c d e fxe2x80x99 having the input numbers 9 through 14 is coded using the matching position and the matching length. In this example, since the matching position is position of the character xe2x80x98axe2x80x99 having the input number 3, and the matching length is 6, xe2x80x98(3, 6)xe2x80x99 is coded as the code of the character string xe2x80x98a b c d e fxe2x80x99 having the input numbers 9 through 14.
When a character xe2x80x98gxe2x80x99 having the input number 15 is input, it does not match any of the previously input characters. Therefore, the character xe2x80x98gxe2x80x99 is coded as is. When a character xe2x80x98hxe2x80x99 having the input number 16 is input, it does not match any of the previously input characters. Therefore, the character xe2x80x98hxe2x80x99 is coded as is. On the other hand, in the LZ78 system, a previously input character string is entered in a dictionary, and an entered input number is coded.
The LZ77 system has higher compression performance than the LZ78 system for data containing a repetition of a long character string. On the other hand, the LZ78 system has higher compression performance than the LZ77 system for data containing a repetition of a comparatively short character string. The LZ77 system and the LZ78 system are described in, for example, the document xe2x80x9cThe Introduction to the Document Data Compression Algorithmxe2x80x9d by Tomohiko Uematsu published by CQ Publishing Company.
A typical system of the probability statistic type coding system can be the arithmetic coding system and the Huffman coding system. Both arithmetic coding system and Huffman coding system obtain a compression effect by allotting a short code length to a character having a high occurrence probability according to the statistic occurrence frequency of each character
The arithmetic coding system is described in, for example, the document xe2x80x9cArithmetic coding revisitedxe2x80x9d by Alister Moffat et al., 1995, IEEE Data Compression Conference, p202-211. The Huffman coding system is described in, for example, the document xe2x80x9cThe Introduction to the Document Data Compression Algorithmxe2x80x9d by Tomohiko Uematsu published by CQ Publishing Company.
To obtain a higher compression effect, a variable length coding method has been suggested based on the conditional occurrence probability (P[Xt|Xtxe2x88x921]) in which not the occurrence probability (P(Xt)) of a single character but the dependence (hereinafter referred to as a context) between an input character and its previous is taken into account. This method is described in, for example, the document xe2x80x9cUnbounded Length Contexts for PPMxe2x80x9d by John G. Cleary et al., 1995, IEEE Data Compression Conference, p52-61.
The probability statistic type coding system as well as the LZ78 system has higher compression performance for data containing a repetition of a comparatively short character string. Normally, the LZ78 system has a higher processing speed than the probability statistic type coding system. On the other hand, the probability statistic type coding system has a higher compression rate than the LZ78 system.
However, the LZ78 system and the probability statistic type coding system have high compression rate for data containing a repetition of a comparatively short character string, but cannot have sufficient compression rate for data containing a repetition of a long character string.
On the other hand, the LZ77 system has high compression rate for data containing a repetition of a long character string, but cannot have sufficient compression rate for data containing a repetition of a comparatively short string.
Therefore, the conventional compression systems have difficulty in obtaining high compression rate for data containing a repetition of long character strings and comparatively short character strings.
The present invention aims at providing a data coding apparatus capable of efficiently compressing both long and short character strings.
To solve the above described problem, the present invention includes a symbol string detection unit for detecting a second symbol string matching a first symbol string having a predetermined length from an input symbol string; a matching length detection unit for detecting a matching length between a third symbol string following the first symbol string and a fourth symbol string following the second symbol string; and a coding unit for coding the input symbol string based on the symbol string detected by the symbol string detection unit and the matching length detected by the matching length detection unit.
Thus, for input data having a repetition of long symbol strings, a part of matching symbol string can be coded based on the matching length. Accordingly, the input data having a repetition of long symbol strings can be efficiently compressed. In addition, since a remaining portion of a matching symbol string is used as a code for use in detecting a matching position, the matching position can be detected without newly inserting a code for use in detecting the matching position. As a result, even when input data having a repetition of short symbol strings is coded using a matching length, the deterioration of a compression rate can be prevented from being caused by a large number of new codes inserted for detection of a matching position.
Furthermore, according to an aspect of the present invention, when a first symbol string matching a second symbol string having a predetermined length occurs, a third symbol string following the first symbol string is coded based on the matching length between the third symbol string and a fourth symbol string following the second symbol string. The portion not coded based on the matching length is coded using the code of a symbol immediately succeeding a symbol string which is a context.
Thus, for input data having a repetition of long symbol strings, a matching symbol string can be coded based on a matching string. Input data having a repetition of short symbol strings can be coded by allotting a shorter code length to a symbol string having a higher occurrence probability. As a result, a high compression rate can be attained for both data having a repetition of long symbol strings and data having a repetition of short symbol strings.
Furthermore, according to another aspect of the present invention, when a first symbol string matching a second symbol string having a predetermined length occurs, a third symbol string following the first symbol string is coded based on the matching length between the third symbol string and a fourth symbol string following the second symbol string. The portion not coded based on the matching length is coded by retrieving a coded word corresponding to the current symbol string from the dictionary in which the symbol strings occurred previously are entered in association with coded words.
Thus, for input data having a repetition of long symbol strings, a matching symbol string can be coded based on a matching string, and input data having a repetition of short symbol strings can be coded by the LZ78 system. As a result, a high compression rate can be attained for both data having a repetition of long symbol strings and data having a repetition of short symbol strings.
According to a further aspect of the present invention, when a first symbol string matching a second symbol string having a predetermined length occurs, a third symbol string following the first symbol string is coded based on the matching length between the third symbol string and a fourth symbol string following the second symbol string. The data coded based on the matching length is further coded using the code of a symbol immediately succeeding a symbol string which is a context.
Thus, for input data having a repetition of long symbol strings, a matching symbol string can be coded based on the matching length. Accordingly, the input data having a repetition of long symbol strings can be efficiently compressed. In addition, when a short symbol string repeatedly occurs in compressed data coded based on a matching length, the compressed data coded based on the matching length can be furthermore compressed by allotting a short code length to a symbol string having a high occurrence probability, thereby attaining a high compression rate.
According to a further aspect of the present invention, when a first symbol string matching a second symbol string having a predetermined length occurs, a third symbol string following the first symbol string is coded based on the matching length between the third symbol string and a fourth symbol string following the second symbol string. The data coded based on the matching length is further coded by retrieving a coded word corresponding to the current symbol string from the dictionary in which the symbol strings occurred previously are entered in association with coded words.
Thus, for input data having a repetition of long symbol strings, a matching symbol string can be coded based on the matching length. Accordingly, the input data having a repetition of long symbol strings can be efficiently compressed. In addition, the compressed data coded based on the matching length can be further compressed by the LZ78 system. Therefore, a high compression rate can be attained for both data having a repetition of long symbol strings and data having a repetition of short symbol strings.
According to a further aspect of the present invention, the occurrence position of a symbol string which previously occurred is stored in association with a predetermined code, and it is checked whether or not a code corresponding to a symbol string immediately before a symbol string coded based on a matching length is stored, thereby detecting the occurrence position of a previous symbol string to be compared when the symbol string is coded based on the matching length.
Thus, when the occurrence position of the previous symbol string to be compared based on the matching length is checked, it is not necessary to check back one by one the previous symbol strings until a symbol string matching in a previous symbol string can be detected, thereby performing a process at a higher speed.
Furthermore, according to a further aspect of the present invention, when a matching length is shorter than a predetermined value, the symbol string is not coded based on the matching length.
Thus, when a matching length is short, the deterioration of a compression rate caused by adding a code indicating a matching length can be successfully avoided, thereby improving the compression rate in a coding process.