The present invention relates to a document analysis system and more particularly to efficient techniques for matching one document to another.
Matching an electronic representation of one image to an electronic representation of another image is useful in many applications. For example, consider an automatic filing application in which document images are stored in directories that contain xe2x80x9csimilarxe2x80x9d documents, where similarity is defined by the degree to which two images have significant areas in common.
There are several available approaches for matching document images. Most approaches can be characterized as consisting of two steps, feature extraction followed by matching of the extracted features to document images in a database. An input image is matched to a database image if they share a significant number of features.
The feature extraction technique used is critical to the performance of the matching system. Ideally, feature extraction should be fast, memory-efficient, and should result in a unique representation for the input image. The uniqueness of the representation assures that a given document image closely matches itself with a high probability and matches no other documents.
Examples of prior art feature extraction techniques used for document image matching operate based on e.g., image texture, character transition probabilities, sequences of consecutive word lengths, invariant relationships between graphic elements of a document, spacings between boxes surrounding connected sets of pixels, etc. What is needed is a document matching system based on feature extraction that improves on the prior art techniques in speed, memory efficiency, and uniqueness of representation.
A fast, memory efficient, and accurate document image matching system is provided by virtue of the present invention. In certain embodiments, document image matching is based on identifying anchor points of characters in the document. The document matching process includes a feature extraction step where anchor points, e.g., points representing approximate locations of characters, are identified as features for matching.
In a particularly efficient implementation, the anchor points are xe2x80x9cpass codesxe2x80x9d in a line-by-line compressed representation of a document image. A pass code within a compressed representation of a given line indicates that a run of white or black pixels present substantially above the pass code on a previous line is not found on a current line. CCITT Group III and Group IV facsimile coding standards are examples of compression schemes that make use of pass codes as may be exploited by the present invention.
Another feature provided by the present invention is the application of a modified Hausdorff metric to compare a set of anchor points found in an input document image and sets of anchor points previously identified for prospective matching document images. This metric has been found to be efficient to compute and robust to image degradation caused by photocopying.
A passcode based implementation has been found to provide fast and accurate matching even when given only one square inch patches of images to use for matching. This type of matching system may be easily embodied in a facsimile receiver where the appropriate compressed representation is already available.