1. Field
Embodiments of the present invention relate generally to data capture by means of optical character recognition of forms, and specifically to a method and system for creating a flexible structure description for a form.
2. Related Art
Data on paper documents may be extracted and entered into a computer system for storage, analysis, and further processing. Examples of said paper documents include invoices, receipts, questionnaires, tax return forms, etc. These paper documents may have varying structures. Advantageously, if the number of documents to be processed is large, automated data and document capture systems can to be used.
A form is a structured document with one or more pages to be filled out by a human, either manually or using a printing device. Typically, a form has fields to be completed with an inscription next to each field stating the nature of the data the field should contain.
Two types of forms can be identified-fixed forms and flexible forms. A fixed form has the same positioning and number of fields on all of its copies (instances) and often has anchor elements (e.g. black squares or separator lines), whereas a flexible, or semi-structured form may have different number of fields which may be positioned differently from copy to copy. Examples of flexible forms include application forms, invoices, insurance forms, money order forms, business letters, etc. (FIGS. 4a-4d). For example, invoices will often have different numbers of fields located differently, as they are issued by different companies (FIGS. 4a and 4b). Further, common fields e.g. an invoice number (401) and total amount (404) may be found on all invoices, even though they may be placed differently.
Flexible forms may be converted into electronic format and made editable by means of a data capture system using Optical Character Recognition (OCR). For efficient data capture, the data capture system has to be trained in advance to detect the useful data fields on documents of the various types that the system will handle. As a result, the system can detect the required fields and extract data from them automatically. A highly skilled expert is required to train the system to detect the necessary data fields on documents of a given type. The training is done in a dedicated editing application and is very labor-intensive.