The exemplary embodiment relates to document processing and finds application in connection with the categorization of scanned document pages in cases where document boundaries exist between some of the pages.
To provide electronic access and storage of documents, paper documents are often scanned in batches and indexed. Document processing service providers often receive large volumes of documents (hundreds of thousands or even millions of pages per day) from customers, either physically or electronically, and assign a document type (doctype) according to a customer-defined taxonomy to each document and perhaps extract relevant information such as customer number or other details. Boundaries between documents may also be detected, often based on the doctype, segmenting the stream of pages into discrete documents. Generally, humans review only a small portion of the pages of the documents, while the rest can be categorized automatically without human intervention. For the service provider, having even a small proportion of the pages manually reviewed adds significantly to the cost.
Traditionally, document segmentation based on categorization has been addressed with techniques such as Markov Random Fields, including Hidden Markov Models (HMM) and Factorial Hidden Markov Models, or Collective Classification, which is related to Markov Random Fields. An HMM can be applied to image data, generally by building feature vectors, or to textual information acquired by optical character recognition (OCR). An HMM can also be applied to both textual and image data, called a hybrid approach, by either applying a model to both OCR data and image data or by combining the output of a textual model and an image model.
Automated document recognition (ADR) systems have been developed which perform document or page type recognition for scanned document pages. For example, Paolo Frasconi, Diovanni Soda, and Alessandro Vullo, “Text categorization for multi-page documents: A hybrid Naïve Bayes HMM approach,” in ACM/IEEE Joint Conference on Digital libraries, (JCDL) 2001, disclose a method to classify pages of sequential OCR text documents using hidden Markov models. The taxonomy of Frasconi, et al., is defined for pages, not for documents, with classes such as “title-page,” “table-of-content-page,” “index-page,” etc., so that a document consists of pages with different types. The HMM models the most likely sequences of page types to form a consistent document.
Standard categorization techniques consider pages in isolation and therefore do not leverage the fact that subsequent pages are very likely to bear the same category. Frequently, pages of a document are labeled with an incorrect doctype, which in turn can cause the automated system to break a document improperly into several documents or to run two unrelated documents together, which therefore need to be indexed by a human. Grouping the pages of a document is referred to as document segmentation or document reconstruction. One way to segment documents is to physically segment the flow of pages with document separations in the paper flow. When documents are received, slipsheets (or stamps) are added to mark the first page of each document. The separators are machine-recognizable. Thus, when a single page of a document is recognized, the full document (all pages between the two separations) can be assigned to that category. Alternatively, the categorization is applied at the document level (all pages between two separators are categorized as “one” document) which can deliver much better performance than taking each page in isolation. This separation can also be performed on electronic documents. Whether done on the physical pages or electronically, the gains in categorization performance are usually offset by the additional separation costs, such as paper and printing costs; manipulation, insertion, and removal of the slipsheet; or additional storage costs.
One problem with physical segmentation, therefore, is that it is not cost effective in most cases. Adding the separator sheets is manually intensive. A second problem is that many of the documents arrive from the customer in bulk, and document separation information is unavailable. Other techniques include handcrafted rules to establish/reconstruct page sequence information, trying to fill in some gaps. In practice however, these techniques achieve low recognition improvements and usually bring many false positives.
There remains a need for a system which automatically identifies document boundaries in bulk collections of digital documents.