1. Field of the Invention
This invention relates to an information registration, storage, and retrieval system and more particularly to a document information compression and retrieval system appropriate for application to text data such as Japanese and English document texts and program languages.
2. Description of Related Art
In recent years, data base services providing document information, patent information, etc., have spread and information processing fields in which text data is handled have become large-scaled and increasingly generalized. This tendency involves an explosive increase of document information handled with general-purpose small office automation devices as well as large-scaled computer systems. To consider registration of more document information in a limited storage capacity or consider high-speed registration, retrieval, and reading of documents stored on low-speed data media, registration of text data on storage media in compressed form provides effective means for information processing.
Hitherto, a method of assigning one code to one character has been used as a text data description method. However, in such a conventional technique, even if the same word (character data string) is input many times as Japanese and English document text data, program languages, etc., each of the same input words (character data strings) is divided into character codes making up the word (character data string) for registration on storage media. Therefore, the text data is redundant and requires a large storage capacity.
A conventional system for solving this problem is described in Japanese Patent Laid-Open No. 62-140136. If it is previously known that the same word (character data string) will be input many times, one compressed code is assigned to the word (character data string) for conversion, then stored on storage media, thereby reducing the necessary storage capacity.
According to this prior art described in Japanese Patent Laid-Open No. 62-140136, it is made possible to register document text data on storage media in compressed form and the capacity required to store the text data can be reduced efficiently. However, the prior art is effective only when the contents of a document to be input are previously known and only for text data containing the same predetermined words (character data strings) which are input many times. Therefore, if unknown text data is input, the system does not compress the text data unless a word (character data string) to which a compressed code happens to be assigned occurs. Further, if unknown text data newly input contains words (character data strings) occurring repeatedly, the system cannot provide effective compression.
For the information retrieval methods of data bases, those skilled in the art focus on a full text search system which enables direct reference to be made to a text of a document such as document information or patent information for retrieval instead of conventional retrieval systems using keywords and sort codes.
The full text search system, as the name implies, handles document texts themselves as retrieval information, and provides a technique which can eliminate thoroughly the bad effects of retrieval using an index, such as the enormous labor overhead involved in index registration and a retrieval error or oversight caused by different persons who register the index and retrieve a document, which always result from retrieval using an index such as keywords or sort codes.
However, the full text search system introduces some problems not related to the index retrieval systems. The greatest problem among them is the retrieval time. The full text search system retrieves document texts themselves, and thus is not practical for retrieval of data base service information, etc., handled so far. For example, if an attempt is made to make a full text search for 20000 documents each having a size of 20 KB, a search must be made for 400 MB of data. If the data is read at the execution speed of 1 MB/s on average and collated at the same speed, about seven minutes is required to complete the retrieval.
A conventional system for solving this problem is described in Japanese Patent Laid-Open No. 03-174652. Document text data is divided for registration on a plurality of magnetic disks and the text data is fetched in parallel from the magnetic disks for speeding up reading of the text data. Further, a table of characters occurring in the text is created and a data file, called a compressed text, is created in which function words such as conjunctions and postpositional particles (postpositional words functioning as auxiliaries to main words), and words occurring repeatedly are eliminated, and a presearch is made in two stages before a full text search, thereby enabling the retrieval speed for practical information retrieval.
According to this prior art, retrieval processing of an enormous quantity of document text data can be completed within a practical time, thereby providing a very useful technique for implementation of a full text search system.
However, the prior art described in Japanese Patent Laid-Open No. 03-174652 uses the 2-stage presearch technique intended for improvement of the text data retrieval speed; to make a presearch at retrieval, it is necessary to previously create a compressed text and a character component table from text data and save them together with the text data on document data save means, such as a magnetic disk. This causes a problem of increasing document data by the capacity of the compressed text and character component table in addition to the text data. Further, since the presearch technique is not fundamentally a text search, a text search is also necessary to produce the final retrieval result and in the worst case, when the document hunt cannot be narrowed down by the presearch, retrieval processing will be again performed for all texts. Thus, the necessary processing time becomes the sum of the retrieval processing time required for the presearch and the retrieval time for all texts. This means that the retrieval time increases instead of becoming shorter.