1. Field of the Invention
The present invention relates to a data compression/decompression technology based on a code table in which codes of one or more bit sizes having a specific meaning, such as document data, CAD (computer aided design) data, program codes, etc., are described.
2. Description of the Related Art
Lately, as a variety of data, such as document data, CAD data, etc., have been handled by a computer, the amount of data to be handled has increased. When such a large amount of data are handled, the storage capacity can be reduced and high-speed transmission to a distant destination can be realized by removing data redundancy and compressing the data.
For example, a method for converting an input character string to a shorter word code using a dictionary having words and corresponding word codes is used as one compression method targeting document data. According to this method, words and corresponding word codes must be prepared in advance. In this case, since, generally speaking, the number of words is large, and special words, such as proper nouns are also included, word codes cannot be assigned in advance to all the words of input data. Under these circumstances, roughly speaking, the following two methods are proposed to handle words to which word codes cannot be assigned in advance.
According to the first method, output codes are assigned to all characters and idle codes are assigned to words. For example, in Japanese code, such as JIS (Japanese Industrial Standard) code, etc., since only a part of the codes out of all available two-byte codes are used for characters of kana, kanji, etc., the remaining idle codes can be assigned to words.
FIG. 1A shows character code areas in the code space of such a two-byte code. This code space corresponds to a two-dimensional space whose first coordinate represents numbers 0x00 to 0xFF in the hexadecimal notation indicated by the higher-order byte of a two-byte code and whose second coordinate represents numbers 0x00 to 0xFF indicated by the lower-order byte. In this example, an area in which the higher-order byte and lower-order byte both are 0x21 to 0x7E is used for character codes, and idle codes in other areas are used as word codes for words.
According to the second method, a switching code is inserted between an unconverted code and a converted word code obtained as a result of compression, and the same code as input data and a word code are distinguished from each other. According to this method, unconverted original code can be overlapped with a word code, and it can be judged whether the next code is a word code or an original code, by detecting the switching code inserted in the compression result.
FIG. 1B shows a case in which the code space of the above-described two-byte code is used for a word code. In this example, all codes except xe2x80x9c0xFFFFxe2x80x9d are used as word codes, and xe2x80x9c0xFFFFxe2x80x9d is used as a switching code. This switching code is inserted in the compression result, for example, as shown in FIG. 1C.
Out of the codes of an input character string xe2x80x9cxe2x80x9d shown in FIG. 1C, xe2x80x9c0x88b38f6bxe2x80x9d corresponding to xe2x80x9cxe2x80x9d is converted to a word code xe2x80x9c0x8260xe2x80x9d, xe2x80x9c0x82b782e9xe2x80x9d corresponding to xe2x80x9cxe2x80x9d is converted to a word code xe2x80x9c0x0011xe2x80x9d, and xe2x80x9c0x8366815b835exe2x80x9d corresponding to xe2x80x9cxe2x80x9d is converted to a word code xe2x80x9c0x8261xe2x80x9d. Then, xe2x80x9c0x826282608263xe2x80x9d corresponding to xe2x80x9cCADxe2x80x9d is left unconverted, and a switching code xe2x80x9c0xFFFFxe2x80x9d is inserted after and before the code.
However, the conventional data compression method described above has the following problems.
According to the method in which all characters are registered in advance, if the number of characters to be registered is large, the number of words to be registered is restricted, and only a few words can be replaced with word codes. Therefore, data cannot be compressed much. For example, if Unicode covering all major characters in the world is used, a substantial part of the code space shown in FIG. 1A is assigned to characters, and only a few idle codes can be used for words. When a user registers an external character, the same problem occurs.
However, according to the method in which a switching code is inserted, if a switching code appears when compression data are decompressed, codes following the switching code are regarded as another kind of codes. For example, if xe2x80x9c0xFFFFxe2x80x9d appears following a word code xe2x80x9c0x0011xe2x80x9d in the compression data shown in FIG. 1C, the subsequent codes are recognized as uncompressed original codes. In this case, if xe2x80x9c0xFFFFxe2x80x9d appears following a code xe2x80x9c0x8263xe2x80x9d, the subsequent codes are recognized as word codes again.
Since in this way, the meaning of codes after and before a switching code varies depending on the position of the switching code, compression data must always be decompressed from the beginning and cannot be decompressed in the mid-course.
An objective of the present invention is to provide a data compression/decompression apparatus for compressing data represented by predetermined codes at a high compression rate and decompressing the compression data from an arbitrary position of the compression data and a method thereof.
In the first aspect of the present invention, the data compression apparatus comprises a code input unit, a dictionary unit, a registration code output unit and a coding unit, and compresses data including codes of one or more sizes.
The code input unit inputs data in units of codes, and the dictionary unit stores a code string consisting of one or more codes and a registration code corresponding to the code string. If the input code string is stored in the dictionary unit, the registration code output unit outputs a registration code corresponding to the input code string. If the input code string is not stored in the dictionary unit, the coding unit generates a new code by adding an additional code to an input code in the input code string and outputs the new code.
In the second aspect of the present invention, the data compression apparatus comprises a code input unit, a dictionary unit, a registration code output unit and a coding unit, and compresses data including codes of one or more sizes.
The code input unit inputs data in units of codes, and the dictionary unit stores a code string consisting of one or more codes and a registration code corresponding to the code string. If the input code string is stored in the dictionary unit, the registration code output unit outputs a registration code corresponding to the input code string. If the input code string is not stored in the dictionary unit, the coding unit generates a new code by dividing an input code in the input code string and outputs the new code.
In the third aspect of the present invention, the data decompression apparatus comprises a dictionary unit, a unit input unit, an identification unit, a removal unit and a code string decompression unit, and decompresses compression data obtained by compressing original data including codes of one or more sizes to the original data.
The unit input unit inputs data in a specific unit, and the dictionary unit stores a code string consisting of one or more codes and a registration code corresponding to the code string. The identification unit judges whether a part of input data is a predetermined additional code. If a part of the input data is the predetermined additional code, the removal unit generates new data by removing the additional code from the input data and outputs the generated data. If a part of the input data is not an additional code, the code decompression unit regards the input data as a registration code and outputs a code string corresponding to the input data.
In the fourth aspect of the present invention, a retrieval apparatus comprises an input unit, a compression unit, a retrieval unit and an output unit.
The input unit inputs a retrieval key, and the compression unit compresses the inputted retrieval key. The retrieval unit retrieves the compressed retrieval key in the compression data, and the output unit outputs a retrieval result.