The present invention generally relates to techniques for proof reading digitized text data, and more specifically, it relates to a system, a method, and a program for supporting proof reading of text data generated through optical character recognition.
In order to store paper documents, disclose paper documents on the Internet, and so on, digitization of existing paper documents is generally performed. Although there may be a case where paper documents are digitized simply as images, printed text of the documents is often converted into digital text by using optical character recognition (hereinafter, may be referred to as “OCR”) for later edition, retrieval, and so on. However, character recognition is not always performed completely correctly and thus recognition errors may occur. Accordingly, in order to minimize the number of recognition errors, it is generally performed that an operator checks correctness of text having been automatically recognized by a computer by using the computer.
As a background art of this field, Japanese Patent Application Publication No. 2003-099709 discloses an optical character recognition apparatus. When an incorrectly recognized character is corrected on a recognition result correction screen, the optical character recognition apparatus refers to a similar-character-shape set file and reflects the correction on the similar character shapes which include the corrected character and are contained in an intermediate file, thereby performing a collective correction process.
As another background art, Japanese Patent Application Publication No. 07-057042 discloses a character reading apparatus that allows an operator to efficiently convert unreadable characters. In the character reading apparatus, unreadable characters that cannot be converted into character data undergo clustering and are classified into each image group. Each image group includes a representative image and other images, and candidate characters for the image group are extracted. On the basis of the representative image group and the corresponding candidate characters, character images belonging to the representative image group and the candidate characters are displayed on the same display screen. If one of the candidate characters is selected as the correct character or the correct character is newly entered on the display screen, individual character images of the representative image group are collectively converted into character data of the correct character.
As another background art, Japanese Patent Application Publication No. 09-179934 discloses a character recognition apparatus that realizes the small work load of an operator. The character recognition apparatus optically scans characters and figures on a printed document, reads the characters and the figures as image data, extracts a character region and a figure region of the read image data, performs a character recognition process on the character region, and displays the recognition result to prompt an operator to check whether the displayed recognition result is correct. If an error is found in the recognition result, the character recognition apparatus accepts an instruction of the operator, corrects/edits the error in the recognition result, and outputs the final result that is obtained by correcting/editing all errors contained in the recognition result. In such a character recognition apparatus, when a recognition result is displayed in which a character arrangement error resulting from an error in the extraction process exists in each of a plurality of consecutive lines, collective correction/edition is performed on a character string group constituted by an incorrect character string existing in each line by specifying the character string group.
As another background art, Japanese Patent Application Publication No. 06-290297 discloses a character recognition apparatus that accumulates feature information along with position information in an image feature accumulation unit, extracts information accumulated in the image feature accumulation unit by using history information regarding correction made by an operator, performs matching on the extracted reference feature and the following feature, and performs automatic correction on the basis of the result.
As another background art, Japanese Patent Application Publication No. 05-314303 discloses an incorrectly read character correction method that reduces the work for correcting incorrectly read characters of an operator and realizes a decrease in the correction time. In the method, a character or character string corrected by an operator through a key operation or the like is stored as a correction history. A character having been corrected before is detected in the OCR result. An incorrectly read part is displayed to the operator with being highlighted, and the operator is allowed to select a correction candidate character.
As another background art, Japanese Patent Application Publication No. 03-240183 discloses a system for automatically correcting a recognized character in a character recognition result, which can automate the work for checking and correcting an incorrectly recognized character. The system includes: character recognition means for executing recognition on a character-by-character basis in a process of checking and correcting a character recognition result; means for displaying a character recognition result; means for supplying a check result and a correction instruction; a table storing all character recognition results; re-arrangement determining means for determining whether or not the similar recognition error has occurred at another character position when a recognition error in which the correct character is ranked in the second place or lower is corrected so that the correct character is ranked in the first place; and control means for controlling operations of the foregoing means. The re-arrangement means modifies the rank of the candidate character estimated as the correct character to the first place at the character position for which it is determined that the similar recognition error has occurred.
As a background art, Japanese Patent Application Publication No. 2003-223608 discloses a recognized character string correction method that enables collective correction of a recognition result and attempts to improve the accuracy of character correction. In the method, a given character is selected from text data of a recognition result on the basis of an operation instruction of an operator, and the selected character is replaced with a replacement character on the basis of the operation instruction of the operator. Then, a processing-target character is shifted to a character that follows the selected character in the text data, and whether or not the processing-target character is the same as or similar to the selected character is determined. If it is determined that the processing-target character is the same as or similar to the selected character, the character is set as a character subjected to automatic correction and is temporarily replaced with the replacement character. The grammatical construction around the replacement character is analyzed and if it is determined that the grammatical construction is correct, the replacement to the replacement character is confirmed to be valid.
As another background art, Japanese Patent Application Publication No. 2005-309608 discloses a character recognition apparatus that collectively displays character images having similar character shapes on a confirmation screen on which character images of the same category are arranged. In an output mechanism of the character recognition apparatus, image data of characters subjected to a character recognition process is classified into each character (category) recognized in the character recognition process, feature values related to shapes of characters included in the image data in each classified category are determined, the image data is further classified into a plurality of clusters on the basis of the feature values, and a confirmation screen for displaying the image data for each cluster is created and displayed.
As another background art, Sueki Matsumura et al. has proposed an input method in which the most efficient check-and-correction mode is automatically selected from among a plurality of modes in accordance with the recognition accuracy, such as the character recognition rate. A check-and-correction mode that shows the maximum input efficiency is associated with each recognition accuracy level in advance by measurement. As the input efficiency, for example, an input rate that represents the number of characters that can be input per unit time is expected. Next, with reference to the obtained correspondence, the check-and-correction mode that provides the maximum input rate is automatically selected from the recognition accuracy level estimated on the basis of the character recognition result of a document. The input work is performed in accordance with the presented check-and-correction mode.
As another background art, Sucharu Miyahara et al. discloses a multi-stage collective correction method using pattern matching, as an efficient check-and-correction method for Japanese OCR. In this method, rejected characters of the same category are collected and collectively corrected in a first-stage rejection process. In a subsequent-stage error correction process, incorrect characters are detected by using the result of the first-stage rejection process and are correctively corrected again.