The present invention relates to optical character recognition and, more particularly, to a method and apparatus for detecting errors in the output of optical character recognition.
1. Background Art
Acquisition of text and graphics from paper documentation is a significant issue among many industries. For example, a publishing company may print hundreds or thousands of academic papers over the course of a year. Often the publishing company works from paper documents, which must be input into the computer equipment of the publishing company. One conventional approach is to hire keyboardists to read the paper documents and type them into the computer system. However, keying in documents is a time-consuming and costly procedure.
Optical character recognition (OCR) is a technology that promises to be beneficial for the publishing industry and others, because the input processing rate of an OCR device far exceeds that of a keyboardist. Thus, employees of the publishing company typically work from scanned documents, which are converted into a computer-readable text format, such as ASCII, by an OCR device.
However, even the high recognition rates that are possible with modern OCR devices (which often exceed 95%) are not sufficient for such industries as the publishing industry, which demands a high degree of accuracy. Accordingly, publishing companies hire proofreaders to review the OCR output by hand.
Proofreading OCR output by hand, however, is very time consuming and difficult for people to do. A person must comb through both the original paper document and a print out or screen display of the OCR output and compare them word by word. Even with high recognition rates, persons proofreading the OCR output are apt to become complacent and miss errors in the text.
Another conventional option is to spell check the resultant computer-readable text. However, not all recognition errors result in misspelled words. In addition, an input word may be so garbled that the proofreader must refer back to the paper text during the spell checking operation. Once the proofreader has looked at the paper text and determined the correct word, the correct word must be keyed into the OCR output text. This approach has been found to be time-consuming and somewhat error-prone.
2. Disclosure of the Invention
There exists a need for facilitating human proofreading of OCR output. In specific, there is a need for reducing the time consumed while proofreading the OCR output.
These and other needs are met by the present invention, in which characters in a document image from an original paper document are recognized (e.g., through OCR) to produce a document text. Regions in the document image that correspond to words in the document text are determined, and recognition confidence parameters are determined for each region. The regions in the document image are displayed in a manner indicative of the respective recognition parameter.
Preferably, the user can select a position in the document image. A selected word is determined according to the region of the document that includes the position in the document image and display, for example in a pop-up menu. In addition, the recognition confidence parameters may be compared to more than one threshold and displayed in a color that corresponds to the thresholds that have been exceeded.
Additional objects, advantages, and novel features of the present invention will be set forth in part in the detailed description which follows, and in part will become apparent upon examination or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.