1. Field of the Invention
The present invention relates to a document search device for searching for a keyword based on a recognition result obtained by character recognition of a document image and a recording medium having a document search program stored thereon.
2. Description of the Related Art
In general, in order to accumulate a document in the form of paper in an electronic document data base, the document in the form of paper is read as image data, and character recognition of the data is performed to convert the data into a collection of electronic character codes (character recognition result). Thus, the document is accumulated in the document data base as the collection of character codes. In order to search for a keyword from the document data base, it is determined whether the keyword is included in the character recognition result. In the case of generally used character recognition some of the characters written in the original document (document in the form of paper) may not be correctly converted into character codes. When such an error occurs in the character recognition, the characters represented by the character codes may be different from the characters in the original document. In this case, when a search for a keyword is performed in the collection of character codes accumulated in the document data base, a search omission may possibly occur. The phrase xe2x80x9csearch omissionxe2x80x9d is defined to indicate that a character string is not detected as a result of the search for a keyword despite that the original document includes a character string which corresponds to the keyword.
A known technology for preventing the search omission is described in, for example, Japanese Laid-Open Publication No. 7-152774.
In accordance with the technology described in Japanese Laid-Open Publication No. 7-152774, an expanded character string is developed at the time of search, using a similar character list for a character or characters, among the characters included in the keyword, which are easily mistaken for other character(s). The similar character list includes a plurality of characters which can be mistaken for the above-mentioned character(s). These character(s) are easily mistaken since there are other characters having similar shapes thereto.
The conventional technology described in Japanese Laid-Open Publication No. 7-152774 will be described with reference to FIGS. 24A and 24B.
FIG. 24A shows a case in which characters xe2x80x9c(xe2x80x98honxe2x80x99)xe2x80x9d and xe2x80x9cxe2x96xa1(xe2x80x98kohxe2x80x99)xe2x80x9d includes in an original document are respectively converted into characters xe2x80x9c(xe2x80x98kixe2x80x99)xe2x80x9d and xe2x80x9c(xe2x80x98kuxe2x80x99)xe2x80x9d having similar shapes thereto by an error in character recognition. The character recognition result is a collection of character codes, but in FIG. 24A, the character codes are shown by the characters corresponding to the character codes for easier understanding. Although the original document includes keyword xe2x80x9c(xe2x80x98nihonxe2x80x99)xe2x80x9d, a search omission occurs when keyword xe2x80x9c(xe2x80x98nihonxe2x80x99)xe2x80x9d is searched for using the character recognition result.
FIG. 24B shows an example of a similar character list. Row 99-1 shows that the character xe2x80x9c(xe2x80x98honxe2x80x99)xe2x80x9d is easily mistaken for characters xe2x80x9c(xe2x80x98kixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98daixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98futoxe2x80x99)xe2x80x9d and xe2x80x9c(xe2x80x98saixe2x80x99)xe2x80x9d. Row 99-2 shows that the character xe2x80x9cxe2x80x9d is easily mistaken for characters xe2x80x9cxe2x96xa1xe2x80x9d (square symbol), xe2x80x9c(xe2x80x98kaixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98enxe2x80x99)xe2x80x9d and xe2x80x9c(xe2x80x98nadoxe2x80x99)xe2x80x9d.
In accordance with the conventional technology described in Japanese Laid-Open Publication No. 7-152774, keyword xe2x80x9c(xe2x80x98nihonxe2x80x99)xe2x80x9d is searched for in the following manner. Using the similar character list shown in FIG. 24B, developed character strings xe2x80x9c(xe2x80x98nichikixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98nichidaixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98nichifutoxe2x80x99)xe2x80x9d and xe2x80x9c(xe2x80x98nichisaixe2x80x99)xe2x80x9d are created. When keyword xe2x80x9c(xe2x80x98nihonxe2x80x99)xe2x80x9d is searched for using the character recognition result, the developed character strings xe2x80x9c(xe2x80x98nichikixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98nichidaixe2x80x99)xe2x80x9d, xe2x80x9c(xe2x80x98nichifutoxe2x80x99)xe2x80x9d and xe2x80x9c(xe2x80x98nichisaixe2x80x99)xe2x80x9d are also used as the keyword. Thus, xe2x80x9c(xe2x80x98nichikixe2x80x99)xe2x80x9d which has been mistakenly converted from xe2x80x9c(xe2x80x98nihonxe2x80x99)xe2x80x9d by character recognition can be found.
By this technology disclosed by Japanese Laid-Open Publication No. 7-152774, when a character included in the document is mistaken for a character which is not included in the similar character list, a search omission cannot be avoided. For example, it is assumed that keyword xe2x80x9c(xe2x80x98jinkohxe2x80x99)xe2x80x9d is searched for using the character recognition result shown in FIG. 24A. Character xe2x80x9c(xe2x80x98kuxe2x80x99)xe2x80x9d, which is mistakenly converted from character xe2x80x9c(xe2x80x98kohxe2x80x99)xe2x80x9d is not included in the similar character list for character xe2x80x9c(xe2x80x98kohxe2x80x99)xe2x80x9d shown in row 99-2 of FIG. 24B. Therefore, developed character string xe2x80x9c(xe2x80x98jinkuxe2x80x99)xe2x80x9d is not searched for, and thus a search omission occurs.
In order to reduce the undesirable possibility of such a search omission, the number of characters included in the similar character list can be increased. However, this increases the number of developed character strings and thus raises the costs (i.e., time and calculation amount) for the search.
According to one aspect of the invention, a document search device for searching for a keyword in a recognition result obtained by character recognition performed on a document image is provided. The keyword includes at least one first character, and a character code is assigned to each of the at least one first character. The recognition result includes at least one second character, and a character code and a partial area of the document image are assigned to each of the at least one second character. The document search device includes a first matching portion specification section for determining whether or not the recognition result includes at least one first matching portion which matches the keyword based on a comparison of the character code assigned to the at least one first character with the character code assigned to the at least one second character, and for specifying the at least one first matching portion when the recognition result includes the at least one first matching portion; a first portion specification section for determining whether or not a remaining part of the recognition result other than the at least one first matching portion includes at least one first portion which fulfills a prescribed first condition, and for specifying the at least first portion when the remaining part includes the at least first portion; and a second matching portion specification section for determining whether or not the at least one first portion includes at least one second matching portion which matches the keyword based on a comparison of a feature amount of the partial area of the document image associated to the at least one second character included in the at least one first portion with a feature amount of an image of at least one first character included in the keyword, and for specifying the at least one second matching portion when the at least one first portion includes the at least one second matching portion. The prescribed first condition includes a condition that the at least one first portion is in the vicinity of a specific second character having a width smaller than a prescribed value.
In one embodiment of the invention, the second matching portion specification section includes a first determination section of determining whether or not the character code of a specific second character included in the at least one first portion matches the character code of a specific first character included in the keyword; a non-matching character specification section for, when the character code of the specific second character included in the at least one first portion does not match the character code of the specific first character included in the keyword, specifying one second character or two or more continuous second characters which include at least the specific second character included in the at lest one first portion and has a width closest to a width of the specific first character as a non-matching character, and a second determination section for, when a distance between a feature amount of an image of the specific first character and a feature amount of an image of an area including one partial area or two or more partial areas assigned to the one second character or two or more continuous second characters included in the non-matching character is smaller than a prescribed value, determining that the specific first character matches the non-matching character.
In one embodiment of the invention, the document search device further includes a calculation section for calculating a prescribed determination reference value from the at least one first matching portion, and a detection section for detecting a second matching portion which fulfills a prescribed second condition among the at least one second matching portion based on the prescribed determination reference value.
In one embodiment of the invention, the calculation section calculates the prescribed determination reference value based on the feature amount of the document image of the at least one area assigned to the at least one second character included in the at least one first matching portion, and the prescribed second condition includes a condition that a distance between the feature amount of the document image of the at least one partial area assigned to the at least one second character included in the at least one second matching portion and the prescribed determination reference value is smaller than a prescribed value.
According to another aspect of the invention, a document search device for searching for a keyword in a recognition result obtained by character recognition performed on a document image is provided. The keyword includes at least one first character, and a character code is assigned to each of the at least one first character. The recognition result includes at least one second character, and a character code and a partial area of the document image are assigned to each of the at least one second character. The document search device includes a first matching portion specification section for determining whether or not the recognition result includes at least one first matching portion which matches the keyword based on a comparison of the character code assigned to the at least one first character with the character code assigned to the at least one second character, and for specifying the at least one first matching portion when the recognition result includes the at least one first matching portion; a first portion specification section for determining whether or not a remaining part of the recognition result other than the at least one first matching portion includes at least one first portion which fulfills a prescribed first condition, and for specifying the at least one first portion when the remaining part includes the at least one first portion; and a second matching portion specification section for determining whether or not the at least one first portion includes at least one second matching portion which matches the keyword based on a comparison of a feature amount of the partial area of the document image assigned to the at least one second character included in the at least one first portion with a feature amount of an image of at least one first character included in the keyword, and for specifying the at least one second matching portion when the at least one first portion includes the at least one second matching portion. A reliability degree of character recognition is further assigned to each of the at least one second character, and the prescribed first condition includes a condition that the at least one first portion is in the vicinity of a specific second character having the reliability degree lower than a prescribed threshold value.
In one embodiment of the invention, the document search device further includes a section for determining an image quality of the document image, and a section for determining the prescribed threshold value based on the image quality of the document image.
In one embodiment of the invention, the second matching portion specification section includes a first determination section for determining whether or not the character code of a specific second character included in the at least one first portion matches the character code of a specific first character included in the keyword; a non-matching character specification section for, when the character code of the specific second character included in the at least one first portion does not match the character code of the specific first character included in the keyword, specifying one second character or two or more continuous second characters which include at least the specific second character included in the at least one first portion and has a width closest to a width of the specific first character as a non-matching character, and a second determination section for, when a distance between a feature amount of an image of the specific first character and a feature amount of an image of an area including one partial area or two or more partial areas assigned to the one second character or two or more continuous second characters included in the non-matching character is smaller than a prescribed value, determining that the specific first character matches the non-matching character.
In one embodiment of the invention, the document search device further includes a calculation section for calculating a prescribed determination reference value from the at least one first matching portion, and a detection section for detecting a second matching portion which fulfills a prescribed second condition among the at least one second matching portion based on the prescribed determination reference value.
In one embodiment of the invention, the calculation section calculates the prescribed determination reference value based on the feature amount of the document image of the at least one area assigned to the at least one second character included in the at least one first matching portion, and the prescribed second condition includes a condition that a distance between the feature amount of the document image of the at least one partial area assigned to the at least one second character included in the at least one second matching portion and the prescribed determination reference value is smaller than a prescribed value.
According to still another aspect of the invention, a document search device for searching for a keyboard in a recognition result obtained by character recognition performed on a document image includes a first determination section for determining whether or not the recognition result includes a partially matching portion with which a part of the keyword matches but the entirety of the keyword does not match, in accordance with a first reference; a first non-matching portion specification section for, when the recognition result includes the partially matching portion, specifying a first non-matching portion of the keyword which does not match the recognition result; a second non-matching portion specification section for specifying a second non-matching portion having a width closest to a width of the first non-matching portion in the partially matching portion; and a second determination section for determining whether or not the first non-matching portion matches the second non-matching portion, in accordance with a second reference which is different from the first reference.
According to still another aspect of the invention, a document search device for searching for a keyword in a recognition result obtained by character recognition performed on a document image is provided. The keyword includes at least one first character, and a character code is assigned to each of the at least one first character. The recognition result includes at least one second character, and a character code and an area of the document image are assigned to each of the at least one second character. The document search device includes a first determination section for determining whether or not at least a part of the keyword matches at least a part of the recognition result based on a comparison of the character code assigned to the at least one first character with the character code assigned to the at least one second character; a first non-matching character specification section for, when a part of the keyword matches the at least a part of the recognition result, specifying a first character among the at least one first character included in the keyword as a first non-matching character; a second non-matching character specification section for specifying one second character or two or more continuous second characters, having a width closest to a width of the first non-matching character, among the at least one second character included in the recognition result as a second non-matching character; and a second determination section for determining whether or not the first non-matching character matches the second non-matching character based on a comparison of a feature amount of an image of the first non-matching character with a feature amount of an image of an area including one partial area or two or more partial areas assigned to the one second character or two or more continuous second characters included in the second non-matching character.
In one embodiment of the invention, the second non-matching character specification section specifies the second non-matching character by making the number of at least one second character variable and repeating a comparison of the width of the first non-matching character and the width of the at least one second character.
In one embodiment of the invention, the second non-matching character specification section calculates a tolerable range of width of the second non-matching character in accordance with the width of the first non-matching character and specifies the second non-matching character under the condition that the second non-matching character has a width within the tolerable range of width.
According to still another aspect of the invention, a recording medium having a program for executing a document search of a keyword in a recognition result obtained as a result of character recognition of a document image is provided. The program including the steps of determining whether or not the recognition result includes a partially matching portion with which a part of the keyword matches but the entirety of the keyword does not match, in accordance with a first reference; specifying a first non-matching portion of the keyword which does not match the recognition result when the recognition result includes the partially matching portion; specifying a second non-matching portion, in the partially matching portion, which has a width closest to the width of the first non-matching portion; and determining whether the first non-matching portion matches the second non-matching portion, in accordance with a second reference which is different from the first reference.
Thus, the invention described herein makes possible the advantages of providing a document search device for reducing search omissions caused by an error in character recognition without raising the costs (i.e., time and calculation amount) for the search; and a recording medium having a document search program stored thereon.