This disclosure relates generally to the field of optical character recognition. More particularly, the present disclosure relates to methods for reducing misidentification of characters during optical character recognition.
The process of obtaining an electronic file of a text message from a physical document bearing the printed text message begins by scanning the document with a device such as optical scanners and facsimile machines. Such devices produce an electronic image of the original document. The output image is then supplied to a computer or other processing device, which performs an optical character recognition (“OCR”) algorithm on the scanned image.
The OCR software then processes the image of the scanned document to differentiate between images and text and determine what letters are represented in the light and dark areas. Older OCR systems matched these images against stored bitmaps based on specific fonts. The hit-or-miss results of such pattern-recognition systems helped establish OCR's reputation for inaccuracy. More modern OCR engines may utilize a variety of techniques to analyze the image and to correlate text characters to the image.
For example, neural network technology may be used to analyze the stroke edge, the line of discontinuity between the text characters, and the background. Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters and makes a best guess as to which character it is. The OCR software then averages or polls the results from all the algorithms to obtain a single reading. Alternatively, the OCR software may use grammar recognition, spell-check, or wavelet conversion, to recognize various characters.
However, conventional OCR algorithms continue to fail on simple distinctions as between, for example, “oar” and “car” or “wet” and “vet” due to information added or removed during copying, printing, or scanning. Even using current systems, optical character recognition cannot efficiently overcome discrepancies between two grammatically appropriate, correctly spelled words.