Typically, data from paper documents are captured into a computer database by a data capture system, which converts paper documents into electronic form (by scanning or photographing documents) and then extracts data from document fields within the document.
Many documents, for example, phone bills, invoices, or registration forms are multi-page documents in that they have more than one page (an example of a multi-page document is shown in FIG. 4 of the drawings). Very often information contained in multi-page documents consists of multiple groups of data having identical structures—for example, each group of fields may have a subheading, a table fragment, a subtotal, or a caption for the table fragment. The number and size of groups may vary from document to document of the given type and, consequently, the number of pages may also vary.
Sometimes, multi-page paper documents are immediately converted into multi-page electronic documents (e.g. into PDF or TIFF files), in which case a data capture system has to know in advance the pages that comprise the multi-page document. In other cases, documents are scanned page by page and appear as a sequence of individual images in the document capture system (sometimes, separator pages are used in this case to separate one document from another). In still other cases, documents of different types may be scanned one immediately after another, without any special separators. Therefore, in the general case, to capture data from a multi-page document, we first need to identify the page images that all belong to a document of certain type and then detect and extract the relevant data from the data fields.
Usually, specially prepared flexible structure descriptions are used to capture data from paper documents. A flexible structure description comprises elements and relationships between the elements. A data field is a type of element that identifies an area on the image from which data are to be extracted and the type of data that this area may contain. The positions of the fields are usually detected based on reference elements, or anchors. An anchor corresponds to one or more predefined image elements (e.g. separator line, unchangeable text, picture, etc.) relative to which the positions of other elements are specified.
A flexible structure description also comprises an algorithm for detecting fields on semi-structured documents.
Flexible structure descriptions are created human experts and are loaded into a data capture system to be automatically matched against incoming documents.