Comparison of two documents is a common practice. When the two documents are written in character code, and each character has its own binary representation, such comparison is fairly simple; the binary string of one document is compared to the binary string of the other document. However, when one (or both) documents is in image format (or contains images), where the characters are not individually represented, and are expressed in pixels indistinguishable from other images (of characters or drawings) in the document, the comparison becomes complicated.
The two most common procedures for comparing two documents which are not in character code format are: (1) by proof reading both documents and verifying that they are similar. The proof reading procedure is both time consuming and prone to errors; (2) by converting the image format to a character code format using an OCR (Optical Character Recognition) software and then comparing two character files with a comparison tool such as is available in many word processing programs. If one (or both) documents are in paper form a scanner is typically used to create an electronic image of the document.
This OCR procedure is limited in several ways:                1. The OCR program may not recognize all the characters correctly. The accuracy rate of OCR programs for printed Latin script is considered to be around 97%-99%, and in other scripts the accuracy is lower.        2. OCR programs typically recognize only one language per document, do not distinguish between different colors in the text, the document's layout is typically not preserved, nor are the borders of tables in the text,        3. The OCR procedure is time consuming. Tests have shown that this process may not save any time compared to the manual task of proof reading the two documents, in large part due to the need to review and correct errors resulting from the OCR process.        4. OCR program performance is largely dependent on having an image which is sharp and taken under uniform and controlled illumination conditions. This necessitates using a scanner. However when the document is in paper form and a scanner is not available, OCR process will not work properly.                    As a result of the reasons elaborated above, using OCR software as part of document comparison procedure does not provide a complete and reliable method.                        
Sussmeier et al (U.S. Pat. No. 7,715,045) describe a method for comparing documents that is based on image comparison instead of text comparison. In the described method, a paper document is scanned by a scanner and the image is compared to a second digital image of a second document on a pixel by pixel level to generate a score indicating the similarity degree between the two images. One document is transformed to resemble the second document by deskewing, adjusting, offsetting as well as normalizing the brightness of one of the documents to match the other document. The transformations described in that patent are global, i.e. apply to the complete image, and are limited only to digital images created by a scanner that are sharp and taken under uniform and controlled illumination conditions.
These assumptions, i.e. the validity of global geometrical and radiometric (brightness and contrasts) parameters as well as the assumption that the acquired image is sharp and may be compared to a digital image created from a text file are typically not valid when the imaging device is a handheld camera and especially when it is a miniature camera such as those cameras typically installed in mobile devices. Furthermore, the method described which contains a pixel by pixel comparison will result in many false-positive signals, for the reasons explained below.
Miyake et al. (U.S. Pat. No. 7,076,086) describe an image inspection device that compares an output, i.e. printed paper on a printer's output tray to a digital source image for the purpose of print quality control.
The method involves a fixed CCD camera that is mounted above the output tray and forms an image of the output printed page. The method performs geometrical and radiometric transformations as well as image resizing and blurring in order to simulate the imaging process and allow easy comparison of the original image and the captured image of the printed paper. A simple Gaussian blurring process is implemented for the simulation of the blurring caused by the camera. Since the camera is fixed in a constant position with respect to the fixed output tray, the geometrical and blurring transformations may be assumed fixed and their parameters may be estimated at the time of design of the system.
The assumptions by Miyake et. al. do not deal with cases where the document is imaged by a low quality camera under variable conditions both in the spatial domain and the temporal domain.
It is the intent of the document comparison method described below to deal with geometrical distortions, illumination conditions and blurring that are not know in advance and vary from one image to another and also within the image. Furthermore, it deals with general blurring functions and not only with theoretical Gaussian blur functions.
In addition, the method described in this patent for comparing a digital image and a corresponding text file may serve as part of an improved OCR algorithm. The method serves as a feedback mechanism allowing to detect potential errors and highlight them or correct them.