Typically, data from paper documents are captured into a computer database by a data capture system, which converts paper documents into electronic form (by scanning or photographing documents) and then extracts data from document fields within the document.
Many documents, for example, phone bills, invoices and registration forms, are multi-page documents in that they have more than one page. An example of a multi-page document is shown in FIG. 4A and FIG. 4B. Often information contained in multi-page documents includes multiple groups of data having identical structures. For example, each group of fields may have a subheading, a table fragment, a subtotal, or a caption for the table fragment. The number and size of groups may vary from document to document of a given type and, consequently, the number of pages may also vary.
Sometimes, multi-page paper documents are immediately converted into multi-page electronic documents (e.g., portable document format (PDF) and tagged image file format (TIFF) files), in which case a data capture system is often required to know in advance the pages that comprise the multi-page document. In other cases, documents are scanned page by page and appear as a sequence of individual images in the document capture system. Page by page feeding is time consuming and error prone. Sometimes separator pages are used to separate one document from another. In such cases, pages from a single document may be placed into a separate electronic document. In still other cases, documents of different types may be scanned, one immediately after another, without any special separators. In these cases, separate paper documents may be erroneously end up in a single electronic document. Therefore, in the general case, to capture data from a multi-page document, it is necessary to identify page images that all belong to a single document of certain type and then detect and extract the relevant data from the data fields. These and other shortcomings of the current art are overcome by use of the teachings described herein.