Digital images can now be easily exchanged and manipulated for a wide range of purposes, both business and personal. Digital images include both pictorial data and digitalized facsimiles of textual documents provided in lieu of hard copies. In response to the wider adoption of these digital equivalents to conventional printed documents, office and personal productivity devices have begun to incorporate digitizers and similar means for directly converting printed content into digital images. Devices, such as copiers, scanners, and digital-capable facsimile machines, can rapidly generate electronically equivalent versions of paper documents. However, further processing is generally needed to put raw converted digital data into usable form, such as needed for word processing or data analysis. The form of processing required depends upon the type of document being converted and includes, for instance, indexing and retrieval, sorting and organization, and automated analysis tasks. Therefore, digital images must often be classified prior to undertaking any further processing steps.
Post-digitization classification of digital images can be problematic where a high volume of documents are being converted, thereby rendering manual classification impracticable. Currently, approaches, such as template matching, discriminative models based on high level feature extraction, ad hoc rule-based systems, and word shape recognition, are used for image classification, but each approach has its shortcomings. Template matching, for instance, can fail due to slight variations in input features identified on digital images, such as caused by translation skew, scaling, extraneous markings, paper folds, or missing parts.
Similarly, high level feature extraction uses content analysis through optical character recognition (“OCR”) or layout analysis. OCR digitally converts image data into text, which can be semantically analyzed to help classify the document. OCR-assisted text-classification works most effectively when the document includes text of sufficient type, quality, and quantity. Moreover, textual data may be insufficient for properly classifying pictorial or form documents, such as income tax returns, which provide scant textual data. Lastly, OCR may not be available in the language of the document.
Layout analysis employs document signatures that are used as category prototypes against which digital images are compared. The prototypes can include features extracted from idealized category examples. Document images are classified according to the closest matching prototype. Layout analysis has narrow applicability due to the significant effort needed to create the prototypes and variations in feature arrangement can cause misidentification or rejects.
Ad hoc rule-based systems look for user-specified features that characterize different categories of documents. These systems evolve by trial and error and easily fail for document images containing features falling outside the assumptions inherent to the model. Moreover, adding in new document categories requires the redefinition of feature discriminative boundaries.
Finally, word shape recognition operates on models of document images that have been segmented by a layout analysis system. Parsed word shapes are applied to a discriminative decision tree to identify an appropriate category. However, word shape recognition requires training using extensive samples of word shapes.
Therefore, there is a need for an approach to performing digital document and image classification that accommodates variability in features without relying upon template matching, heuristic rules, or high level feature classification, such as OCR.