The present exemplary embodiments disclosed herein relate generally to image processing. They find particular application in conjunction with localizing data fields of forms, and will be described with particular reference thereto. However, it is to be appreciated that the present exemplary embodiments are also amenable to other like applications.
Forms are a type of document that provide pre-defined data fields for entry of data. The spatial organization of data fields facilitates capture of data in a structured and organized fashion by human and automatic means. In a straightforward case, each data field can be cropped out of an image of the form and run through Optical Character Recognition (OCR) individually. This is called zonal OCR.
In an industrial production document processing application, it is desirable to use zonal OCR algorithms. One advantage of zonal OCR algorithms is that they enhance accuracy of OCR by constraining the character set and character combinations (lexicon) allowed on a per-field basis. Another advantage is that they may be built into highly efficient production workflows. In a production setting, it can be cumbersome or impossible to redefine the boundaries of each data field in a zonal OCR process on an image-by-image basis using the output of the assignment algorithm.
Zonal OCR works correctly when printed and/or handwritten data is confined to the correct locations on the form, as defined by the boundaries of the data fields. However, zonal OCR fails to work correctly when printed and/or handwritten data is misregistered with respect to the data fields.
With reference to FIG. 1, an example of a color-dropout form is provided. As illustrated, all of the background form information has been removed and the only markings scanned are printed and/or handwritten data entries and image noise. Further, boxes representing the nominal locations and boundaries of the data fields are overlaid on the color-dropout form. Color-dropout forms are convenient because the printed background form information generally cannot be confused with entered data. With reference to FIG. 2, a close-up of a region of the color-dropout form illustrates misregistration between data and data field boundaries.
In view of the foregoing, a challenge with zonal OCR is how to associate printed and/or handwritten data with corresponding data fields even when the data falls outside the delineated boundaries of the data fields. A solution to this challenge would advantageously permit zonal OCR to be applied to documents whose data actually occurs outside of intended field boundaries.
Known solutions expand the data field boundaries used for zonal OCR. This works satisfactorily as long as the boundary expansion includes the intended data, but does not include data from adjacent fields. However, when data fields are close together and/or data is misregistered, this approach leads to incorrect assignments of data to data fields.
The present application provides new and improved methods and systems which overcome the above-referenced challenges.