In large-volume automated document analysis and understanding systems, paper documents are scanned and processed using OCR (optical character recognition) and region analysis programs. OCR (and/or segmentation) engines break each page into individual “zones,” within which the image of text has been translated into editable text. In some applications, the OCR engines are configured to include the segmentation engines, and to thereby combine various functions advantageously.
Unfortunately, the zones created by such OCR engines fail to provide the flexibility required by applications configured to process the zones. For example, article-extraction applications are configured to extract articles from zones created by OCR engines. During operation of such an application, several zones representing text associated with several articles may be on one page of a document. However, articles are often assembled with “extra” and/or “missing” zones.
Accordingly, a need exists for an automated document processing system that is better able to configure zones, and that is better able to extract articles.