Optical character recognition (“OCR”) is a computational technique for transforming images of text into text that can be recognized by a computing system. OCR engines/algorithms can be used to perform useful functions, such as enabling the electronic search of the content of an image of a document. OCR engines generally evaluate each portion of an image that could represent a letter, a number, or some other character. For example, an OCR engine may be set to evaluate each line, circle, dot or speck in an image to determine if these shapes represent letters, numbers or other characters. For example, an OCR engine may be set to evaluate each line to determine if the line is a number 1, a letter L, part of a number 7, part of a number 4, or the like. As another example, the OCR engine may be set to evaluate each dot or speck in an image to determine if the speck is a period, part of an ellipsis, or part of the lower case letter “i”. Although these are simple examples of OCR engine functions, these examples illustrate that additional artifacts in an image can cause an OCR engine to operate significantly slower because the artifacts could be legitimate characters that should be converted to text by the OCR engine.
Artifacts in an image can make it difficult for an OCR engine to transform portions of an image into text. Artifacts in an image (e.g., visual artifacts) are generally unintended, undesired, and/or non-beneficial anomalies that are manifest in an image or in a representation of an image. Artifacts in an image, such as lines, dots, smears, and other image-related distortions can trigger analysis events within an OCR engine, and can cause the OCR engine to take orders of magnitude longer to process an image (as compared to the absence of the artifacts). What's more, artifacts in an image can cause the OCR engine to misinterpret or incorrectly translate portions of an image into text, in such a way that the resulting text fails to represent the text that is in the image. Incorrectly translating images into text can undermine the utility of an OCR engine and can reduce or destroy customer trust in software systems that employ OCR engines.
What is needed is a method and system for identifying and addressing imaging artifacts to enable a software system to provide financial services based on an image of a financial form, according to various embodiments.