Data capture systems are used to extract data from paper documents or from images created from such documents. A typical data capture system consists of an imaging device that acquires the image of the document and software that runs on a computer that processes the acquired image.
Typically, data from paper documents are captured and entered into a computer system by a data capture system, which converts paper documents into electronic form (by scanning or photographing documents) and then extracts data from document fields within the document for storage, analysis, and further processing. These paper documents may have varying structures.
A structured document is a fixed or flexible form with one or more pages to be filled out by a human, either manually or using a printing device. Typically, a form has fields to be completed with an inscription next to each field stating the nature of the data the field should contain.
A fixed form has the same positioning and number of fields on all of its copies (instances) and often has anchor elements (e.g. black squares or separator lines), whereas a flexible, or semi-structured form may have different number of fields which may be positioned differently from copy to copy.
Examples of flexible forms include application forms, invoices, insurance forms, money order forms, business letters, receipts, tax return forms, etc. For example, invoices will often have different numbers of fields located differently, as they are issued by different companies. Further, common fields e.g. an invoice number and total amount may be found on all invoices, even though they may be placed differently.
To process structured documents, a data capture system should be provided with information about such fields. The information may include the position of the fields in relation to page boundaries or other objects, properties of the data, validation rules, etc. Advantageously, if the number of documents to be processed is large, automated data and document capture systems can to be used.
For efficient data capture of flexible forms, the data capture system has to be trained in advance to detect the useful data fields on documents of the various types that the system will handle. As a result, the system can detect the required fields and extract data from them automatically. A highly skilled expert is required to train the system to detect the necessary data fields on documents of a given type. The training is done in a dedicated editing application and is very labor-intensive.
Many documents, for example, phone bills, invoices, questionnaires or registration forms are multi-page documents in that they have more than one page. Very often information contained in one-page or multi-page documents may contain repetitive structures (e.g. repetitive fields or groups of fields). In other words it consists of multiple groups of data having identical structures—for example, each group of fields may have a subheading, a table fragment, a subtotal, or a caption for the table fragment. The number and size of groups may vary from document to document of the given type and, consequently, the number of pages may also vary.
Multi-page document may have tables with complex and non-regular structure, which cannot be recognized by common method of detecting rows and columns or by detecting table cells.