1. Field
Embodiments of the present invention involve implementations of methods and systems for creating a document structure description and capturing data from a document image.
2. Related Art
Typically, data from paper documents are captured into a computer database by a data capture system, which converts paper documents into electronic form (by scanning or photographing documents) and extracts data from document fields within the document.
Many documents, for example, phone bills, invoices, or registration forms are multi-page documents in that they have more than one page (an example of a multi-page document is shown in FIGS. 4A and 4B of the drawings). Often information contained in multi-page documents includes multiple groups of data having identical structures—for example, each group of fields may have a subheading, a table fragment, a subtotal, or a caption for the table fragment. The number and size of groups may vary from document to document of the given type and, consequently, the number of pages may also vary.
Sometimes, multi-page paper documents are immediately converted into multi-page electronic documents (e.g. into PDF or TIFF files), in which case a data capture system has to know in advance the pages that comprise the multi-page document. In other cases, documents are scanned page by page and appear as a sequence of individual images in the document capture system (sometimes, separator pages are used in this case to separate one document from another). In still other cases, documents of different types may be scanned one immediately after another, without any special separators. Therefore, in the general case, to capture data from a multi-page document, it is required to identify the page images that belong to a document of a certain type and then detect and extract the relevant data from the data fields.
As described in U.S. application Ser. Nos. 12/364,266 and 11/461,449, specially prepared flexible structure descriptions are used to capture data from paper documents. A flexible structure description comprises elements and relationships between the elements. A data field may be a type of element that identifies an area on the image from which data are to be extracted and the type of data that this area may contain. The positions of the fields are detected based on reference elements or anchors. An anchor corresponds to one or more predefined image elements (e.g., separator line, unchangeable text, picture) relative to which the positions of other elements are specified.
A flexible structure description may comprise an algorithm for detecting fields on semi-structured documents. Flexible structure descriptions are typically created by human experts and are loaded into a data capture system to be automatically or programmatically matched against incoming documents. However, existing techniques are inadequate for quickly and adaptively creating flexible structure descriptions for multi-page documents.