The present invention relates to a document storage and retrieval system for filing documents as an image, and is particularly concerned with a document storage and retrieval system capable of full text searching.
The typical information retrieval system has hitherto provided a retrieval of data chiefly according to a keyword and a classification code. Bibliographic information and patent information have been processed to form a data base by means of the system mentioned above. Mainly bibliographic information including abstracts in its coverage is processed for a data base here, but the situation is such that only a part of its function is realized to cope with the true need of information retrieval. That is, even if a document or patent conceivably relevant is found, there is the need to search among a lot of bookshelves to obtain the text.
Meanwhile, an optical disk capable of storing a mass data has now been available for loading the text in the data base to provide a so-called original document information service, thus coping with a social need. A paperless documentation at the Patent Office is so planned accordingly. In these systems, volumes of documents are stored in optical disks in the form of image data, and a conventional information retrieval technique based mainly on a keyword search is applied.
However, the conventional information retrieval technique is only effective to orders of tens to hundreds, and hence a further technique for squeezing relevant documents to 1/10 in number or so is desired. One method is that in which an original document (text) stored as image data is called onto a terminal and read visually by a retriever. The method is secure in principle, however, documents amounting to hundreds maximumly are too many to read out in the form of image data, and reading one by one visually is not efficient practically as a matter of course.
On the other hand, the conventional method based on the keyword and classification code must be updated all the time for the classification system itself changes as time passes, thus leaving an intrinsic problem. For example, volumes of documents classified already cannot be modified practically as the classification system is subjected to modification later. Documents and patents recording a progress of science and technology are novel in content and hence of value because they provide a new data conception which often is not included in the conventional classification system. For this purpose, it is impossible to define beforehand the keyword and the classification system representing a conception originally, which is a problem essentially for the information retrieval system.
For the reason as mentioned above, it is desirable to provide a method which will retrieve contents with reference directly to the text of a document. According to the method for referring to the text, a retrieval can be practiced by means of a vocabulary recognized as a conception which was not deemed to be important when the document was registered in a data base but is taken new at the point of time of retrieval. Or otherwise, an important document can be searched out directly without a "filter" or an indexer (specialized for giving index) at the time of registration.
To satisfy such a requirement, it is necessary that a character pattern is extracted from the document as an image data and the text is replaced by a character code, and a character recognition technique may be applied therefor. However, a document or a printed document, for example, which is an object for filing is not perfect character recognition from the point of view of diversification of the kinds of print quality and font. In a conventional optical character reader, imperfect recognitions such as error, rejection and the like are subjected to checks and corrections by operators. (For example, "Introduction to Character Recognition" by Hashimoto, Ohm-Sha, 1982, pp. 153-154) Accordingly, even if the recognition precision is extremely high, a method for checking visually a result obtained through recognizing the text is not realistic where the amount of documents is very large, and hence a document filing system with images as the main constituents which is available for text retrieval has not been realized until now.