1. Field of the Invention
The present invention relates to a search apparatus for searching for a keyword from a character recognition result obtained as a result of recognition of characters in an original document using an index table, a program for causing the search apparatus to execute search processing, and a recording medium for having the program stored thereon.
2. Description of the Related Art
Recently, as the use of Internet has widely spread, a search technology for searching for necessary information from a huge amount of information existing on a network has become a target of attention as an important technology. Especially, many systems for searching for a specific keyword from text data have already been provided. Such search systems are required to be capable of performing an accurate and high-speed search of a huge amount of text documents.
A technology of searching for a specific keyword from text data using an index table in order to perform a high-speed search operation is known. The index table defines an index character string including a prescribed number of characters (for example, two characters) and a position, of a portion in the text data, which corresponds to the position of the index character string.
For searching for a keyword from an assembly of character codes (character recognition result) which is obtained as a result of recognition of characters in an original document (e.g., a document in the form of paper), it is necessary to consider an error in the recognition of the characters (erroneous recognition). The reason for this is when an error occurs in the recognition of characters, the character represented by a character code can be different from the character actually written in the original document. The “erroneous recognition” refers to when the character written in the original document is not correctly converted into a character code. The erroneous recognition is caused for example, by the character printed on a sheet of paper being faint or inclined or by the sheet of paper being stained.
For example, where an original document includes a character string “” at a certain position thereof and the character “” in the character string is erroneously recognized as “” the character recognition result includes a character string “” at a position of a portion thereof corresponding to the position of the character string “” As a result, an index table which is prepared from the character recognition result has an index character string “” and the position thereof registered thereto. Accordingly, a search operation for a keyword “” using this index table does not result in the keyword being detected at that position in the character recognition result. Thus, the state in which although there is a keyword at a certain position in the original document, the keyword cannot be detected at that position, i.e., a “search omission” occurs.
According to one known technology for solving the problem of “search omission”, a plurality of candidate characters are prepared as a character recognition result for one character in an original document, and a plurality of character strings having a possibility of existing in the original document based on the plurality of candidate characters are registered in an index table. A search operation for a keyword is performed using this index table. Such a technology is disclosed in, for example, Japanese Laid-Open Publication No. 9-16619 entitled “Method and Device for Processing Information”.
FIG. 11 shows an example of an index table 1901 which is prepared according to the above conventional method. The index table 1901 has a plurality of character strings having a possibility of existing in an original document registered as index character strings. In the example shown in FIG. 11, the index table 1901 is obtained as a result of the recognition of characters in an original document which includes a character string “”. In the index table 1901, an index character string “” and an index character string “” are both registered as existing at a character position “1” (row 1911 and row 1912) in the character recognition result.
Using the index table 1901 shown in FIG. 11, the keyword “” can be detected. Hereinafter, processing for searching for the keyword “” using the index table 1901 according to the conventional method will be described.
First, character strings of two adjacent characters included in the keyword are generated. From the keyword “” five character strings “”, “”, “”, “” and “” are generated.
Then, these character strings are retrieved from the index table 1901. The character strings “”, “”, “”, “” and “” are respectively shown as existing at character positions “1”, “2”, “3”, “4” and “5” (rows 1912, 1919, 1915, 1914 and 1913).
From the positional relationship among these character positions, it is determined that the keyword “” is included in the character recognition result.
Such a conventional method of using an index table having a plurality of character strings having a possibility of existing in an original document registered as index character strings solves the problem of search omission.
However, the above-describe conventional method has a problem in that search noise is increased. “Search noise” refers to a keyword being detected despite the keyword not being included in the original document. For example, when the index table 1901 shown in FIG. 11 is used to search for the keywords “” and “”, these keywords are detected at a character position “3”. In order to determine whether the search result is correct or not, the user needs to compare the search result with the original document.
As the number of candidate characters which are prepared as a character recognition result for one character is increased in order to prevent the problem of search omission, such search noise occurs more often. As a result, the burden placed on the user to determine whether the search result is correct or not is increased.