Businesses, industries, and other organizations such as real estate agencies, government agencies, corporations, and so forth typically have numerous standard forms that are used with regularity. For example, a real estate agent typically completes one or more forms for each real estate transaction, such forms including form entries for attributes of the property to be purchased, form entries about the buyer, form entries about the seller, and form entries for other relevant information about the transaction. Similarly, a corporation or other employer typically has job applicants or new hires complete various standard forms providing information such as name, address, employment position or position sought, contact information, and so forth.
The blank form is typically filled out by hand, using a typewriter, or using a computer. The completed form is typically signed by one or more authorized persons, and a completed and signed paper copy is sent to a central collection point (such as a central office of a real estate agency, or the office of human resources of a corporation or corporate division, or so forth) where the form entries are to be read into a suitable database. The form reading can be done manually, e.g., clerical staff can be provided to manually transcribe each form entry into the database. However, such a manual approach is inefficient and prone to human error.
Accordingly, automated reading of such completed forms is of interest for increasing efficiency and accuracy. One approach is to use standardized software for generating the completed form. For example, some word processing programs provide form capability including form entry dialog boxes that can be completed by an end-user. In such cases, the form entry dialog boxes are readily identifiable by the word processing program. However, this approach requires the use of a standardized software program or suite of programs by all persons or entities involved in generating completed forms. Such standardization is sometimes not achieved within a corporation or other organization. Moreover, if forms are completed by outside persons or entities, these outside persons or entities may use incompatible software. The form may also be printed out as a blank form that is completed by hand or using a typewriter.
To accommodate forms that are generated by different types of software or by hand or using a typewriter, it is convenient to process the completed forms as paper originals or copies. Technology exists to optically scan the completed form to generate a digital image, and to perform optical character recognition (OCR) to derive a text-based converted document from the scanned digital image. Off-line handwriting recognition software can operate analogously to OCR to convert handwritten form entries to textual content.
However, existing systems have difficulty in accurately identifying the form entries in the text-based converted document.
In one approach, the OCR text is divided into a textbox for each word, number, or other grouping of letters and/or number, and each textbox includes spatial coordinates of the text on the physical form page. The form entries are then identified based on their position on the form page as reflected by the spatial coordinates stored with each textbox. Errors in positioning the original paper document on the scanner can be corrected by registration processing that translates or rotates the scanned image prior to performing OCR. Such approaches are suitable when the form has a known layout which is precisely the same for each completed form.
In practice, however, the form layout may differ between completed forms, even when the original blank form is nominally identical. For example, different printing systems may use different fonts, different paper sizes, different pagination, or so forth which results in the different printed forms having differences in the spatial layout. Mechanical problems in the printing or scanning processes can also create discrepancies in the printed form layout. Still further, in some cases the blank form may be modified, either globally (e.g., an updated version of the form may be released with different “boilerplate” text that changes the layout), or locally (e.g., a local office may update the form to accord with local laws or other local circumstances, thus changing the layout). Even apparently small changes in the form layout can be problematic when form entries are identified based on spatial position on the page.
Thus, there remains an unfulfilled need for a form reader for processing scanned completed forms, which is robust against layout changes due to form revision updates, local versioning of the form, mechanical differences in printing of the blank form or scanning of the completed form, and so forth.