The present invention relates to a document analysis system and more particularly to efficient techniques for matching one document to another.
Matching an electronic representation of one image to an electronic representation of another image is useful in many applications. For example, consider an automatic filing application in which document images are stored in directories that contain "similar" documents, where similarity is defined by the degree to which two images have significant areas in common.
There are several available approaches for matching document images. Most approaches can be characterized as consisting of two steps, feature extraction followed by matching of the extracted features to document images in a database. An input image is matched to a database image if they share a significant number of features.
The feature extraction technique used is critical to the performance of the matching system. Ideally, feature extraction should be fast, memory-efficient, and should result in a unique representation for the input image. The uniqueness of the representation assures that a given document image closely matches itself with a high probability and matches no other documents.
Examples of prior art feature extraction techniques used for document image matching operate based on e.g., image texture, character transition probabilities, sequences of consecutive word lengths, invariant relationships between graphic elements of a document, spacings between boxes surrounding connected sets of pixels, etc. What is needed is a document matching system based on feature extraction that improves on the prior art techniques in speed, memory efficiency, and uniqueness of representation.