1. Field of the Invention
The present invention relates to an information searching apparatus and method, an information searching program, and a storage medium storing the information searching program, and in particular to an information searching apparatus and method and an information searching program that search for a desired document among documents including multimedia information such as characters and images, and a storage medium storing the information searching program.
2. Description of the Related Art
Conventionally, there is known an information searching apparatus that uses a method called full-text search for searching for a desired document among a plurality of documents including multimedia information. In this apparatus, a desired search keyword or phrase, for example, is inputted as search information and documents including words or phrases that match the inputted search keyword or phrase are obtained from a stored group of documents.
To enable searches for information based on the contents of documents including document images, an apparatus constructed to perform character recognition on character image portions included in the document images and perform the full-text search based on character information obtained as a result of the character recognition has also been proposed.
However, there is the possibility that a document including character codes obtained as the result of character recognition (hereinafter referred to as a “character recognition processed document”) includes misrecognized characters, so that in the case where a full-text search is performed using the same method as for a text that has not been subjected to the character recognition, there can be erroneous search hits where there is a match for characters that differ to those in the original document and an increase in the number of missed search hits.
For this reason, before a full-text search is carried out for a character recognition processed document, it is customary for a user to go through the character recognition processed document being searched, for misrecognized portions and correct the misrecognized portions one by one.
To dispense with such visual corrections, a method has been disclosed that selects, using a plurality of characters that are candidates for character recognition together with assumed values indicative of probability thereof, a plurality of candidate characters, and therefore reduces the number of missed search hits even for a character recognition processed document including erroneously recognized characters (Japanese Patent No. 2586372). That is, by carrying out a search including a plurality of character recognition candidate characters, it is possible to reduce the number of missed search hits.
However, there is the risk of a decrease in search accuracy, for example, in a case where a character string that should be recognized as “” (“monorail”) has been misrecognized as “” as shown in FIG. 5, if a search is carried out for the character string “”, the misrecognized character string “” matches and is therefore given as an erroneous search hit.
Also, in the case of a character recognition processed document comprised of only character codes obtained by character recognition, even if the above method is used, since information on other candidate characters is required during the character recognition process, so that favorable results cannot be expected and the problems of erroneous search hits and an increased number of missed searched hits remain.
On the other hand, an information searching apparatus using a word index has also been proposed. Such apparatus carries out morpheme analysis that looks not just at index information in character units but also collates or compares the characters with words that actually exist and registers extracted words as index information for document searching purposes. Compared to the information searching apparatus that searches in character units, this information searching apparatus that carries out a word search can avoid matches that extend over boundaries between words and the like, making it possible to improve the search accuracy. However, since in actuality it is not possible to record every word in a word dictionary, information searches carried out using such word index are not able to search for words not present in the dictionary and there can be missed search hits.