1. Technical Field
The present disclosure relates to document classification systems, and, in particular, to a system and method for document image classification based on visual appearance.
2. Description of Related Art
The use of digital input scanners, which can successively scan a set of sheets and record the images thereon as digital data, is common in the office context, such as in digital copiers and electronic archiving. Document categorization is a technique utilized to analyze a scanned document and assign one or more pre-defined category labels to the analyzed document. In this manner, automatic analysis tasks (e.g., indexing, retrieval, sorting, organization) may be tailored to specific document types.
In high volume document scanning scenarios, considerable time and resources are dedicated to visual and/or functional categorization of documents. Typically, a recently-obtained “input image” is compared to a predetermined and preprocessed “reference image” or “training model.” In a practical situation, such as in a digital copier or a network printing and copying system, the reference image must be somehow obtained in advance. In a basic case, such as when a user of a digital copier is scanning in what is known to be a set of slides with a uniform template, the user can indicate to the scanning system through a user interface that the first-scanned page image in the set should serve as the reference image in regard to subsequent page images in the set. A variation of this idea would be to have the user cause the scanning system to enter a “training phase” of operation in which a plurality of sheets believed to have a common “template” are scanned in and analyzed using an algorithm to find objects common to all of the sheets. From this training phase of operation, a basic template of common objects can be derived. This basic template of common objects can be used to determine the reference image data.
To make scanned documents searchable, some document classifier engines index electronic documents utilize Optical Character Recognition (OCR) technology. This technique is typically slow (e.g., 1-2 pages per second) for high volume document scanning operation where speed (e.g., 20-30 pages per second) is needed. Further, OCR technology is not capable of recognizing graphical features (e.g., logos, shapes, etc.) within an image document or recognizing image documents of the same category that are represented with different language locales. This shortcoming is exposed in various document classification scenarios such as, for example, wherein images belonging to the same category of document are visually different but are nonetheless labeled the same. During the training phase, most classifier engines combine the computed features of scanned images belonging to the same category to generate training data and/or a training model. This method of generating training data and/or training models contributes to poor accuracy and slow processing during subsequent classification of scanned documents.