The present exemplary embodiments disclosed herein relate generally to image processing. They find particular application in conjunction with localizing data fields of forms, and will be described with particular reference thereto. However, it is to be appreciated that the present exemplary embodiments are also amenable to other like applications.
Forms are a type of document that provide pre-defined data fields for entry of data. The spatial organization of data fields facilitates capture of the data in a structured and organized fashion by human and automatic means. In a straightforward case, each data field can be cropped out of an image of the form and run through Optical Character Recognition (OCR) individually. This is called zonal OCR. Zonal OCR works correctly when printed and/or handwritten data is confined to the correct locations on the form, as defined by the boundaries of the data fields. However, zonal OCR fails to work correctly when printed and/or handwritten data is misregistered with respect to the data fields.
With reference to FIG. 1, an example of a color-dropout form is provided. As illustrated, all of the background form information has been removed and the only markings scanned are printed and/or handwritten data and image noise. Further, boxes representing the nominal locations and boundaries of the data fields are overlaid on the color-dropout form. Color-dropout forms are convenient because the printed background form information generally cannot be confused with entered data. With reference to FIG. 2, a close-up of a region of the color-dropout form illustrates misregistration between data and data field boundaries.
In view of the foregoing, a challenge with zonal OCR is how to associate printed and/or handwritten data with corresponding data fields even when the data falls outside the delineated boundaries of the data fields. A solution to this challenge would advantageously permit zonal OCR to be applied to regions of the page where the data actually occurs instead of merely where the data is supposed to occur. Known solutions expand the data field boundaries used for zonal OCR. This works satisfactorily as long as the boundary expansion includes the intended data, but does not include data from adjacent fields. However, when data fields are close together and/or data is misregistered, this approach leads to incorrect assignments of data to data fields.
The present application provides new and improved methods and systems which overcome the above-referenced challenges.