1. Technical Field
The present invention relates to searching for a character string in text data.
2. Related Art
In recent years, OCRs (Optical Character Recognition systems) have come into widespread use. Such systems are used for inputting/reading a document, for generating electronic data (image data) from the read/input document, and for generating text data from the image data. The generated text data may then be searched for particular character strings.
A success in performing character recognition depends on a quality of a document or on which character recognition is to be performed, and also on an environment in which the character recognition is to be performed. FIG. 9A shows an example of a document that includes soiling A. FIG. 9B shows a result of character recognition for the document shown in FIG. 9A. FIG. 9C is a drawing illustrating the character recognition malfunction that is shown in FIG. 9B.
In FIG. 9A, the document has document blocks 903 and 904. Each of the document blocks has four lines of character strings. In this case, there exists a problem in that during an OCR operation document blocks may not be recognized properly due to the presence of the soiling A on the document. In the example in FIG. 9B, the document block 904 is shown to be incorrectly recognized as having two separated document blocks. The first line of the document block 904 is incorrectly recognized as two separated lines, line 1 and line 5. Similarly, the second line of the document block 904 is incorrectly recognized as having two separated lines, line 2 and line 6. Furthermore, the third line of the document block 904 is also incorrectly recognized as having two separated lines, line 3 and line 7.
In a case such as that discussed above where an error in character recognition occurs, character strings may be recognized to occur in an order indicated by arrow b in FIG. 9C, instead of being correctly recognized as occurring, for example, in the order indicated by arrow a. When such a recognition error occurs, OCR is likely to generate incorrect text information and as a consequence, keywords in the generated text information may not be searchable.
JP-A-2001-337993 discloses a technology for improving accuracy of a keyword search. According to JP-A-2001-337993, a character(s) included in a keyword is searched for. Then, location information on the character(s) obtained by the search is extracted. The location of the keyword is estimated on the basis of the extracted location information; and, in addition, a keyword search is performed using pattern matching.