(1) Field of the Invention
The present invention relates to an apparatus and method for compressing electronic texts to reduce the data size thereof.
(2) Description of the Related Art
The introduction of electronic texts has advanced a so-called "paperless system" in both the civil and public sectors. The electronic texts, or otherwise known as text files, are the texts represented by an array of character codes. The electronic texts are advantageous in that they are easily copied or delivered, and occupy only a small storage area.
When an electronic text is transmitted to a distant correspondent via a communication network, or delivered by means of a recording medium, such as a floppy disk or CD-ROM, the data size of the electronic text is reduced by compressing the character codes to save the communication cost and increase the number of the texts to be stored on one disk.
The electronic texts are generally compressed by lossless coding methods, such as LZW coding and Huffman coding. In the following, Huffman coding will be explained as an example.
In Huffman coding, the probabilities with which words appear in the texts are first determined. Next, bit strings are assigned to the words and an example of the bit strings is shown in FIG. 1: the words assigned short strings of bits have larger probabilities, whereas the words with smaller probabilities map to longer strings of bits. The words in the electronic text are converted into the strings of bits while referring to the compression table shown in FIG. 1. Assume that a sentence "XXX Electric will release a new product." consisting of six words, "XXX Electric", "will", "release", "a", "new", and "product", is converted into a string of bits. If a character code is 8 bits long, then, for example, a 32 bits long (8.times.4=32) word "will" is converted into a 4-bit string "1000" in Huffman coding, reducing the code size to one eighths (4/32=1/8).
Huffman coding is most effective when the probabilities are determined as accurately as possible and the short strings of bits are assigned to the words with large probabilities.
However, in practice, a writer avoids repeating words or phrases and instead uses synonyms to have a wide vocabulary; for example, the writer may use "an apology" for the first occurrence and "an excuse" for the second. This reduces the probabilities for each word, and the resulting Huffman coding is not satisfactorily efficient. In addition, the fact that verbs and auxiliary verbs have different forms makes it difficult to increase compression efficiency. For example, an irregular verb "be" has six forms: "be", "is", "are", "was", "were", and "been". If the verb "be" invariably appears as "is", then the word is compressed efficiently; however, having the different forms reduces the probabilities for the verb "be", and hence making Huffman coding less effective.
There are other coding methods known as "lossy coding". An example of the lossy coding is disclosed in Japanese Laid-open Patent Application No. 4-156663. In the lossy coding, each sentence in one electronic text is morpheme-parsed while referring to a certain dictionary to be partitioned into words. Then, semantic information of each word is retrieved from the dictionary, based on which the importance of each word is determined. If the determined importance is below a predetermined level, such words are deleted to reduce the amount of data. However, because the reduction is limited to the data amount of the deleted words, the compression efficiency is not satisfactory either. Besides, the restored data as a result of compression and subsequent expansion are not true to the original data because some words are omitted.
Huffman coding may be combined with the lossy coding to enhance the compression efficiency: a text is coded after less important words are deleted. However, the text is restored by omitting some words as is with the case in the lossy coding alone.