1. Field of the Invention
The present invention relates to an information retrieval apparatus and method that also compress and encrypt.
2. Description of the Related Art
Content for a dictionary or the like is conventionally configured with a single file described in 16-bit code character data following Japanese Industrial Standard, JIS X4081 is compressed by a compression technique such as, International Patent Application No. WO 99-21092.
The compression technique recited in International Patent Application No. WO 99-21092 pamphlet can retrieve and display content in a compressed form. The content is blocked into 2 KB blocks and then compressed using a conventional Huffman tree, and is decompressed in blocks when retrieved. The technique is based on Huffman compression resulting in the fastest decompression compared to other compression schemes. To reduce the Huffman tree, 16-bit code characters of low occurrence frequencies are substituted with two 8-bit code characters and compressed. The compression technique can compress not only characters but also binary data such as address pointers.
In recent years, however, content in Hypertext Markup Language (HTML) format has abruptly increased with the propagation of the Internet. Content in HTML format employs Shift JIS code and the like, including character data in 8-bit, 16-bit, and 32-bit codes mixed, and is configured not with a single file but with a plurality of files. Index search technology has also been established in recent years for improving retrieval efficiency.
However, even with this index search technology, needs are rising for the utilization and storage of retrieved content of a large-volume in a portable terminal, such as a mobile phone or a portable digital assistant (PDA), without decompression. Needs for the encrypting of retrieved content are also rising to prevent falsification and/or to protect data copyrights. Needs are also rising for faster full text retrieval.
With full text retrieval, a retrieval process of comparing input keywords and character strings of each text file is executed. The process involves the opening and closing of individual text files, a time consuming task that impedes the speed of the process. To overcome this obstacle, a system has been proposed aiming to speed up full text retrieval by adding full text retrieval indices. Another system has also been proposed aiming to speed up the full text retrieval by reducing the number of files by linking files. An engine or the like capable of fast comparisons and employing automaton theory is used for comparing keywords and character strings.
Conventional techniques of Japanese Patent Nos. 2817103 and 2835335 disclose a method that obtain faster full text retrieval by extracting and then utilizing occurrence frequency and addresses of each character of retrieval target data.
A conventional technique of Japanese Patent Application Laid-Open Publication No. 4-274557 discloses a full text retrieval method that extracts characters from a text of a document to be registered for each character type and that creates a character occurrence bitmap of one-bit information indicating characters occurring in the text. During retrieval, the full text retrieval method refers to the character occurrence bitmap, selects only documents that include all of the characters in a character string targeted for retrieval, and from the selected documents, extracts only documents that include the retrieval character string in their texts.
However, with the increased prevalence of HTML format and the increasing number of files, such as with content for dictionaries, a problem of decreased compression efficiency of the content has arisen. For example, the conventional technique recited the International Patent Application No. WO 99-21092 does not support the compression of files that included mixed character data of 8-bit, 16-bit, and 32-bit codes, such as Shift JIS code, resulting in a problem of reduced compression efficiency for content targeted for retrieval and including Shift JIS code.
Since each file of content is formed by blocks having a fixed even number of bytes when compressed, there is a problem in that separation of a character string occurred at boundaries of blocks, and thus retrieval precision of compressed content was reduced. There is a problem in that the conventional technique can not sum the occurrence frequency for each character data if 8-bit character data is included as the conventional technique sums the occurrence frequency only for 16-bit character data, and thus compression rate is drastically reduced.
There was a problem that compression rate was reduced if a file includes a large volume of phonogram data such as alphanumeric characters and/or Japanese kana/katakana characters (hereinafter, “kana characters” and “katakana characters”).
To prevent falsification and to protect content data copyrights, needs for encryption have been rising. However, a problem of reduced retrieval speed due to the time required for full text retrieval of each file of the content targeted for retrieval and the decrypting process for display.
Although a conventional technique has aimed to increase full text retrieval speed by allocating full text retrieval indices, there is a problem in that the volume of highly compressed content is still large as the volume of the full text retrieval indices is large. With the method aiming for increased speed by the linking of files, a problem arises in that file size becomes rather enormous, necessitating increased memory for retrieval.
Thus there is a problem in that storage of the content in a portable terminal, such as a mobile phone or a PDA, or a standalone personal computer is difficult.
There is a problem that full text retrieval speed is reduced if each file of the content targeted for retrieval includes phonogram data such as alphanumeric characters and/or kana/katakana characters.
While comparison is made for every one byte or character in verification of a character string of uncompressed data, comparison needs to be made shifting every one bit for compressed data as identification of boundaries among characters is difficult. However, bit-by-bit operation of compressed file is difficult for a computer, and thus there is a problem in that verification speed is reduced as a result, such as in the conventional technique of Japanese Patent Application Laid-Open Publication No. 4-274557.
As for data such as additional data for files in large number and electronic forms (monthly aggregation table) that has a consistent population, names and addresses of customers and names of items appearing do not change, and parameters of Huffman compression are substantially similar. Since electronic forms have large processing volumes, increased speed of compression/encryption is desired.