The present invention relates to a data compressing apparatus and a data decompressing apparatus for compressing and decompressing source data of a plurality of kinds of character codes which mixedly exist in a character code space and, more particularly, to a data compressing apparatus and a decompressing apparatus for efficiently performing a compression and a decompression in accordance with the kind of character code with respect to a Unicode in which a plurality of language codes mixedly exist, a JIS code or a shift JIS code of a Japanese code space, or the like.
In recent years, various kinds of data such as character code, vector information, image, and the like is treated in a computer and an amount of data to be treated is rapidly increasing. When a large amount of data is treated, a redundancy portion in the data is eliminated and a data amount is compressed, so that a memory capacity can be reduced and the data can be transmitted at a high speed. As a method which can compress various data, a universal encoding has been proposed. In this instance, although the invention is not limited to a compression of character codes but can be also applied to data in various fields, denominations which are used in an information theory are also used hereinbelow. It is assumed that one word unit of data is called a character and an arbitrary word to which data is connected is called a character string.
As a representative method of the universal encoding, there is a Ziv-Lempel code (in more detail, for example, refer to Munakata, "Data compressing method of Ziv-Lempel", The Information Processing, Vol.26, No.1, 1985). In the Ziv-Lempel code, two algorithms such as slide dictionary method and dynamic dictionary method have been proposed. Further, as an improvement of the slide dictionary method, there is an LZSS code (refer to T. C. Bell, "Better OPM/L Text Compression", IEEE Trans. on Commun., Vol. COM-34, No. 12, Dec. 1986). As an improvement of the dynamic dictionary method, there is an LZW (Lempel-Ziv-Welch) code (refer to T. A. Welch, "A Technique for High-Performance Data Compression", Computer, June, 1984). Among those codes, the LZW code is used in a file compression or the like of a memory device because a high-speed process can be performed and the algorithm is simple.
FIG. 1 shows a tree structure of a dictionary in the LZW code. FIG. 2 shows an encoding of character strings in the LZW code. The LZW encoding has a rewritable dictionary, classifies data of input character codes into different character strings, adds numbers in accordance with the appearance order of the character strings, registers the character strings into the dictionary, and also expresses the character string which is being inputted at present by a number of the longest coincidence character string registered in the dictionary, thereby encoding. One character which doesn't coincide is added to the present character string and is registered.
The encoding will now be specifically described in detail with reference to FIGS. 3 and 4. In this instance, in order to simplify the explanation, an encoding of data comprising a combination of three characters of "a", "b", and "c" will be explained as an example. First, input data in FIG. 3 is read in the direction from the left to the right. When the first character "a" is inputted, since there is no coincident character string other than "a" in a dictionary in FIG. 4, an output code (reference numeral .omega.) is outputted as a code word. A reference numeral 4 is added to a character string "ab" which was expanded by adding the next character "b" and the resultant character string is registered into the dictionary. In the actual registration, the character string is registered in a form of (1b). Subsequently, the second character "b" is positioned at the head of the character string. Since there is no coincident character string other than "b" in the dictionary, the reference numeral 2 is outputted as a code word and the expanded character string "ba" is actually registered into the dictionary in a form of "2a" by adding a reference numeral 5. The third character "a" is positioned at the head of the next character string. In a manner similar to the above, the above processes are continued.
A flowchart of FIG. 5 is an algorithm of the LZW encoding. First in step S1, a character string consisting of one character is previously registered as an initial value with respect to all characters and, after that, the encoding is started. In step S2, the first character K inputted is set to a reference numeral .omega. to retrieve the dictionary and is used as a prefix string. In step S3, the next character K of the input data is read. In step S4, whether a character string (.omega.K) obtained by adding the character K read in step S3 to the prefix string .omega. obtained in step S2 exists in the present dictionary or not is retrieved. If YES in step S4, the character string (.omega.K) is exchanged to the reference numeral .omega. in step S5. A check is made in step S5 to see if the input data has been finished. After that, the processing routine is returned to step S3 again and the retrieval of the maximum coincidence length is continued until the character string (.omega.K) is not found in the dictionary. When the character string (.omega.K) doesn't exist in the dictionary in step S4, step S7 follows and the reference numeral .omega. of the character K obtained in step S2 is outputted as a code word (.omega.). A new reference numeral is added to the character string (.omega.K) and the resultant character string is registered into the dictionary. Further, the input character K in step S2 is exchanged to the reference numeral .omega., a dictionary address N is increased and the judging process in step S5 is executed. After that, the processing routine is returned to step S2 and the next character K is read.
A decoding process of the LZW code will now be specifically described with reference to FIG. 6. As a decoding process, the operation opposite to the encoding is executed. To simplify the explanation, in a manner similar to the encoding process in FIG. 3, the decompression of data comprising a combination of three characters of "abc" will be explained as an example. The first input character is 1. Since the characters "a", "b", and "c" have already been registered as reference numerals 1, 2, and 3 in the dictionary as shown in FIG. 4, the first character is exchanged to the character string "a" of the reference numeral which coincides with the code 1 with reference to the dictionary and is outputted. The next code 2 is also likewise exchanged to the character "b" and is outputted. In this instance, a new reference numeral is added to "1b" obtained by combining the code processed at the preceding time and the first character "b" decoded at this time and the resultant character string is registered into the dictionary. The third code 4 is exchanged from "1b" to "ab" by the retrieval of the dictionary, so that a character string "ab" is outputted. At the same time, a new reference numeral 5 is added to the character string "2a" (=ba) obtained by combining the code 2 processed at the preceding time and the first character "a" of the character string decoded at this time and the resultant character string is registered into the dictionary. The processes are repeated in a manner similar to the above. In the decoding of FIG. 6, there are the following exceptional processes. The exceptional processes occur in the decoding of the sixth input code 8. The code 8 is not defined in the dictionary upon decoding and cannot be decoded. In this case, a character string "5b" obtained by adding the first one character "b" of the character string "ba" decoded at the preceding time to the code 5 processed at the preceding time is obtained and is further exchanged to "2ab" and "bab" and is outputted. After the character string was outputted, a reference numeral 8 is added to the character string "5b" obtained by adding the character "b" of the character string decoded at this time to the code 5 at the preceding time and the resultant character string is registered into the dictionary. The exceptional processes are performed through processes in steps S4 and S9 of a decoding process in FIG. 7, which will be explained hereinbelow. In step S7, finally, the output of the character string and the registration into the dictionary of the character string obtained by adding the reference numeral to a new character string are executed.
A flowchart of FIG. 7 shows a decoding algorithm of the LZW code. First in step S1, in a manner similar to the encoding, the character strings each comprising one character are previously registered as initial values into the dictionary with respect to all characters and, after that, the decoding is started. In step S2, the first code (reference numeral) is read and the present input code "CODE" is set to "OLDcode". Since the first code corresponds to any one of the reference numerals each comprising one character which have already been registered in the dictionary, a character "code (K)" which coincides with the input code "CODE" is found out and the character K is outputted. The outputted character K is set to "char" for the subsequent exceptional processes. Step S3 follows and the next code "CODE" is read and is set as "NEWcode". Step S4 follows and a check is made to see whether the code "CODE" inputted in step S3 has been defined (registered) in the dictionary or not. Since the inputted code word has been generally registered in the dictionary by the processes up to the preceding time, step S5 follows and a character string "code (.omega.K)" corresponding to the code "CODE" is read out from the dictionary. In step S6, the character string K is temporarily stacked and reference numeral "code (.omega.)" is set to a new "CODE" and the processing routine is returned to step S5. The procedures in steps S5 and S6 are recursively repeated until the reference numeral .omega. reaches one character. Finally, step S7 follows and the character stacked in step S6 is popped up in an LIFO (Last-In First-Out) format and is outputted. Simultaneously, in step S7, a new reference numeral is added to a character string expressed by (.omega., K) by combining the code .omega. used at the preceding time and the first character K of the character string decoded at this time and the resultant character string is registered into the dictionary. In this instance, in case of a code (such a case occurring in the case where a reference numeral just before is referred to in the encoding) which is not registered in step S4, in step S9, "OLDcode" is returned to the code "CODE" and "code(OLDcode, char)" is returned to "NEWcode". After that, the processing routine advances to step S5.
However, in such conventional data compressing and decompressing processes, in spite of a fact that the one-byte construction and the two-byte construction mixedly exist in the actual character code, they are regarded as characters of the same byte construction and processed, so that there is a problem such that an effective compression cannot be expected. FIG. 8A shows a conventional data compressing process in which data is compressed on a byte unit basis by a single-byte compressing unit 400. FIG. 8B shows a conventional decompressing process of compression data, in which the data is likewise decompressed on a byte unit basis by a single-byte decompressing unit 402. In this instance, when considering Japanese as a representative language as an example, in various kinds of codes expressing Japanese, namely, character codes such as JIS code, shift JIS code, and the like, a character and a character string are expressed in a form of a plurality of bytes or a form in which a single byte and a plurality of bytes mixedly exist. On the other hand, in the compressing process, as shown in FIG. 8A, since the data is all processed as characters and character strings of a single byte by the single-byte compressing unit 400, the character expressed by the single byte and the lower byte of the character expressed by a plurality of bytes are regarded as the same character. Therefore, there is a problem such that by the byte-unit compression of the characters consisting of a plurality of bytes, a meaningless data string is eventually registered into the dictionary and is encoded and an effective compression cannot be expected.
FIG. 9 shows the LZW encoding in the JIS Kanji code. Since data is fetched into the dictionary irrespective of the upper byte and lower byte, it will be understood that a meaningless character string is also registered and a compressing effect cannot be expected. Namely, a meaningless character string such as a combination of lower byte and upper byte of two adjacent characters or the like is registered. FIG. 10 shows the LZW encoding in the shift JIS Kanji code. In a manner similar to the above, since data is fetched into the dictionary irrespective of the upper byte and lower byte, it will be understood that a meaningless character string is also registered and a compressing effect cannot be expected.
There is a similar problem even in character codes other than Japanese. For example, even in a Unicode proposed as a character code in which various languages are integratedly treated by the international standardization, since one character is constructed by two bytes (or 4 bytes), in the conventional compressing process in which data is compressed on a byte unit basis, a similar problem occurs. Particularly, even if the same character kind is used, when the language differs, a connecting method of characters differs. However, hitherto, since a character string has been registered without considering a difference due to the languages, a compressing effect cannot be expected.