In the field of intelligent document understanding, one of the basic first steps is to identify what kind of object has been scanned. At a high level, most objects can be classified as either a photograph or a document. A document containing a photograph would also typically be classified as a document as long as there was text somewhere else on the page. It would be advantageous as a first step to separate the documents from the photographs. By doing so, image processing methods and algorithms that are tuned to the image type can be employed to maximize the image quality. Another reason to separate photographs and documents is to enable using different compression schemes to optimize storage and transmission time. Compression schemes vary greatly for photographs where lossy compression can usually be tolerated. Documents are typically compressed using a lossless method to preserve image and text clarity for further use in optical character recognition (OCR).
Methods of document classification typically rely on lexical features of a document. In Chapter 16 of the book entitled “Foundations of statistical natural language processing” (MIT Press, Cambridge, Mass., 1999), authors Manning and Schutze provide a comprehensive review of classification procedures for text documents. The described methods, including: decision trees, maximum entropy models, perceptrons, and k-nearest neighbor classification, rely on the analysis of contextual features within the document. Such analysis can be time consuming and is not applicable for analyzing documents that do not contain text.
U.S. Pat. No. 7,920,296 to Beato et al., entitled “Automatic determining image and non-image sides of scanned hardcopy media,” describes a method for separating a photographic image from its non-image side based on spatial frequency characteristics. One method of characterizing the spatial frequency characteristics is by calculating the compression factor of the scanned digital image. It is well known that scanned digital images with high frequency content (such as photographs) will not compress as well as a scanned digital image with large areas of low frequency content (such as many documents). While this approach could be used in a simple photograph and document classifier, it would not produce robust results. For example, documents with high densities of text would or embedded images would be classified incorrectly as photographs.
U.S. Patent Application Publication 2009/0067729 to Turkelson et al., entitled “Automatic document classification using lexical and physical features,” describes a system that uses physical characteristics and lexical information to classify documents (e.g., as receipts or business cards). Examples of physical features that can be used for document classification include colorfulness, orientation, size, margin widths and horizontal and vertical projections. Lexical characteristics are determined by performing optical character recognition and performing textual analysis to determine a set of lexical features. A machine learning system is trained to discriminate between known types of documents in order based on the physical and lexical features.
U.S. Pat. No. 5,953,450 to Kanamori et al., entitled “Image forming apparatus correcting the density of image information according to the type of manuscript,” describes a system that uses density histograms to set white and black points for reproduction. The density histograms are also used to distinguish between photograph and a text document.
U.S. Pat. No. 7,039,856 to Peairs et al., entitled “Automatic document classification using text and images,” describes a system for automatic document classification based on textual content as well as visual appearance. A new document is automatically stored in one or more directories based on comparing the characteristics of the new document to those of documents that have been previously stored in the directories. This method will typically be slow, since each unknown document must be examined using textual analysis, which can be time consuming.
There remains a need for a robust and efficient method to automatically distinguish between photographs and documents.