Today, due to the increase in the creation and transmission of electronic document images and scanning of paper documents, many document images are maintained in database systems that include retrieval utilities. Consequently, it has become increasingly important to be able to efficiently and reliably determine whether a duplicate of a document submitted for insertion is already present in a database because duplicate documents stored in the database will needlessly consume precious storage space. Determining whether a database contains a duplicate of a document is referred to as document matching.
The area of image and document retrieval is a well-established field. One goal of image and document retrieval is to convert image information into a form that allows easy browsing, searching, and retrieval. Over the last twenty years, many methods have been developed from text indexing to document matching using complex object descriptions, e.g. faces, animals, etc. Traditionally, the image analysis that is necessary to extract desired information from an image is performed in the pixel domain. As a consequence, speed and computational complexity become an issue for large images such as scanned documents.
Image and/or document retrieval has a rich and long history. Typically, characteristic image features derived from the original image are combined into a one- or multi-dimensional feature vector. Those feature vectors are then used for measuring similarities between images. The features (or attributes) can be divided into two categories, semantic and visual attributes. Semantic attributes are usually based on optical character recognition (OCR) and language understanding. The visual attributes use pure image information and include features like color histograms. Some methods combine the two and link images to nearby text. A good overview of the area of image retrieval is given in “Image Retrieval: Current Techniques, Promising Directions, and Open-Issues,” by Y. Rui and T. S. Huang, Journal of Visual Communication and Image Representation, vol. 10, pp. 39-62, 1999.
In currently available image-content based retrieval systems, color, texture and shape features are frequently used for document matching. Matching document images that are mostly bitonal and similar in shape and texture poses different problems. One common document matching technique is to analyze the layout of the document and look for structurally similar documents in the database. Unfortunately, this approach requires computationally intensive page analysis. Thus, most retrieval methods are located in the pixel domain.
Because the majority of document images in databases are stored in compressed formats, it is advantageous to perform document matching on compressed files. This eliminates the need for decompression and recompression and makes commercialization more feasible by reducing the amount of memory required. Of course, matching compressed files presents additional challenges. Some work has been focused in the compressed domain for G4 images. More specifically, the prior art in the compressed domain for G4 images is concentrated on matching G-4 compressed fax documents. For CCITT Group 4 compressed files, pass codes have been shown to contain information useful for identifying similar documents. In one prior-art technique, pass codes are extracted from a small text region and used with the Hausdorff distance metric to correctly identify a high percentage of duplicate documents. However, calculation of the Hausdorff distance is computationally intensive. In another G4-based retrieval method, up- and down-endpoints are extracted from the compressed file (groups of text rows) and used to generate a bit profile. The matching process is divided into coarse matching and detailed matching. Feature vectors derived from the bit profile are used for coarse matching. A small segment of the bit profile is used in the detailed matching. For more information, see U.S. Pat. No. 6,363,381, entitled “Compressed Document Matching,” issued to D. S. Lee and J. Hull on Mar. 26, 2002.
In another prior art technique involving compressed documents, segmentation of documents occurs in the compressed JPEG domain. More specifically, in this technique, a single-resolution bit distribution is extracted from a JPEG encoded image by decoding some of the data to extract the number of bits spent to encode an 8×8 block. Based on this distribution, a segmentation operation is performed to segment the image into text, halftone, contone, and background region. For more information, see R. L. deQueiroz and R. Eschbach, “Fast Segmentation of the JPEG Compressed Documents,” Journal of Electronic Imaging, vol. 7, no. 2, pp. 367-377, 1998.
In another prior art technique involving feature extraction from the compressed data domain, side information is encoded containing first and seconds moments of the coefficients in each block. The moments are the only information used for retrieval. For more information, see Z. Xiong and T. S. Huang, “Wavelet-based Texture Features can be Extracted Efficiently from Compressed-Domain for JPEG2000 Coded Images,” Proc. of Intl' Conf. on Image Processing (ICIP) 2002, Sep. 22-25, 2002, Rochester, N.Y.
In still another prior art technique, features are extracted during decoding of a JPEG 2000 codestream. More specifically, a map resembling an edge map is derived during decoding by localizing significant wavelet coefficients. Note that this technique requires that some decoding of data is performed. For more information, see Jian, J., Guo, B., Li P., “Extracting Shape Features in JPEG-2000 Compressed Images,” Lecture Notes in Computer Science, vol. 2457, Springer Verlag, Berlin, 2002.
Visual similarity of binary documents is typically being described by a one-dimensional feature vector that captures global and local characteristics of the document. The feature vector is then used to measure similarity with other feature vectors by evaluating the inner product of the two vectors. Typically used features include global features, projection features, and local features. Global features include a percentage of content (text, image, graphics, non-text), dominant point size for text, statistics of connected components (count, sum, mean, median, std., height, width, area, perimeter, centroid, density, circularity, aspect ratio, cavities, etc.), a color histogram, the presence of large text, and the presence of tables. Projection features include a percentage of content in row/columns, column layout, and statistics of connected components (width, height). Local features include dominant content type, statistics of connected components (width, height, etc.), column structure, region-based color histograms, relative positions of components. These features have only been used in the pixel domain.
For more information on visual similarity of binary documents, see M. Aiello et al., “Document Understanding for a Broad Class of Documents,” 2002; U.S. Pat. No. 5,933,823, entitled “Image Database Browsing and Query using Texture Analysis,” issued to J. Cullen et al., Aug. 3, 1999; and C. K. Shin and D. S. Doermann, “Classification of Document Page Images Based on Visual Similarity of layout structures,” Proc. SPIE, Vol. 3967, Document Recognition and Retrieval VII, pp. 182-190, San Jose, Calif., 2000.
There are a number of other methods and systems for content-based image retrieval for photographic pictures. The survey paper of Y. Rui and T. S. Huang discussed above gives an overview of the type of features derived from images. Another paper entitled, “Content-Based Image Retrieval Systems: A Survey,” by R. D. Veltcamp R. C. and M. Tanase, Technical Report UU-CS-200-34, Department of Computing Science, Utrecht University, October 2000, gives an overview of complete systems and their features. One that is widely known is probably QBIC by IBM, but there are many more. The methods discussed in these references are based on processing image values and are not performed in the compressed domain. Typically features derived from images are color histogram, geometric histogram, texture, shape, faces, background, spatial relations between objects, indoor/outdoor, and connected components (size, center, vertical and horizontal projections, etc.). Again, these features are only derived from images that are represented in the pixel domain.