The present invention relates to the field of image processing and storage, and more specifically, to comparing an input document to a database of stored documents and finding documents matching the input document.
A document database is a collection of documents, digitally represented. Typically, these documents begin as paper documents which are then digitally captured on scanners or digital photocopiers, however they may also be non-paper documents, such as the output of a word processing program or a facsimile machine. For each document, which may contain multiple pages, and/or a portion of a page, a tag is provided to uniquely identify the document, such as a document number. A multi-page document might be considered to be multiple images, and a paper document might be considered distinct from the image present on the paper, however these distinctions are not relevant here, and the terms "document" and "image" are herein used interchangeably to mean an item, digitally represented and discretely identified, in a document database or an item input to a query engine for comparison to documents in the document database. The content of a document can be text, line art, photographic images, computer-generated images, or other information, or combinations of these types of content.
A document may be retrieved by querying the document database for a document number or other unique identifier assigned more or less independently of the contents of the document, but more useful is the ability to query the document database using some feature or features of the content of the document(s) sought. Also, the ability to test an input document against the documents in the database for matches is useful. For these abilities, an indexing system is required. These features, and the documents from the database which "have" these features are associated in an index, which is either generated ahead of time, or generated on the fly from a scan of all the documents in the database, with the former usually the preferred method.
Thus, a feature is used to locate an entry in an index, and that entry indicates the document(s) having that feature. This index is either stored in one place separate from the document database, or is distributed as additional data attached to each document. For example, suppose all the documents are stored merely as blocks of text (no images or formatting), such as a series of ASCII files. In this example, a feature might be a string comprising the first N words of the text block, a count of the number of times a specified character or word appears in the text block, or a count of the total number of characters in the text block.
This index allows for two types of queries, depending on the input to a query engine. In one type of query, feature inputs are provided, and in the other type, an input document having those features is provided. An example of the former is a query where a feature such as total character count is the query input, and the response to such a query is a list of documents having that number of total characters. With the second type of query, a document is input to the query engine and the response to the query is the documents in the document database which match the input document. Of course, where a set of feature inputs can be generated from an input document and an input document can be generated which has the features indicated by the feature inputs, either type of query can be used in either system.
One query that is of interest in the above example of a document database is a search for documents with a given passage of text. The objective of this query is to determine whether the given passage of text exists elsewhere in the document database. However, in some environments, the documents are stored as images not as text. In these cases, text image matching provides an important capability for a document database query system.
A text image matching system is useful in a legal office to locate previous revisions in a document database of a given input document even if edits have been made. Another example of the usefulness of text image matching is in a classified document copier, where sensitive materials are copied and digital images of copies made are retained in a document database. Given a classified document, it is sometimes necessary to determine whether a document was copied, when the document was copied and which other documents were copied at approximately the same time.
An obvious implementation of a document database in which an input image is matched to images in a document database is to apply optical character recognition (OCR) to each document in the document database and store the resulting text. To query the database, an input document is also converted to text using OCR, and the resulting text is matched against the text in the document database using known methods of text matching. However, this has the disadvantage that an OCR process must be applied to all the text in the database documents and the resulting ASCII data stored for subsequent searches. The OCR must also be tolerant to the range of image distortions that occur in practice. If the full image of the matched document is to be retrieved, the document database cannot merely consist of the ASCII data, but the ASCII data must be stored in addition to the text images. Furthermore, this method does not extend well to non-text images or images which combine text and graphics such that a search might look for a document by matching text and graphics. The above method is also not very tolerant to noise, such as where the document database comprises documents entered by scanning and OCR.
If storage space and processing power are at a premium, an alternative solution is a system which matches an input document image directly against the image data in the database. This bypasses the need for OCR and reduces the associated storage requirement. Any necessary invariance to image distortions are modeled directly at the image level.
Various solutions have been proposed for matching queries to database entries when both are images, but none have been found to be completely acceptable. For example, in a top-down structural search method, an image is reduced to a listing of the objects shown in the image and the spatial relationships between the objects. This listing is used as an iconic index for the image. An iconic index is generated and stored for each of the documents in the document database, and to perform a query, the iconic index for the input document is generated and compared to the stored iconic indices. For further discussion of this, see S. K. Chang, Q. Shi, and C. Yan, "Iconic Indexing by 2-D Strings", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 3, May 1987.
Several versions of an iconic indexing system exist using what is known as two-dimensional strings. See, for example, C. C. Chang and S. Y. Lee, "Retrieval of Similar Pictures on Pictorial Databases", Pattern Recognition 24, 7 (1991) 675-80, and G. Costagliola, G. Tortora and T. Arndt, "A Unifying Approach to Iconic Indexing for 2-D and 3-D Scenes," IEEE Transactions on Knowledge and Data Engineering 4, 3 (June, 1992) 205-22.
In such a system, the geometric relationships between objects in the image are represented by strings. A query then uses a string matching algorithm to locate images in a database that match a query. However, the success of such a system relies on, among other things, accurate pattern recognition to determine correctly what objects are present in an image.
Hashing has been used to speed up matching in a two-dimensional string query system. With hashing, each image in a document database is represented in a document index by sets of ordered triples, and an input document of a query is represented by a set of ordered triples. Each triple contains the identity of two objects in the image and the spatial relation (one of nine direction codes) between them, and an index entry for that triple points to those database images in the document database which contain an instance of that triple. The images in the document database that match a query are determined by identifying those triples present in the input document, collecting lists of images for each triple and intersecting the lists. A query is satisfied if the intersection is not empty. While this top-down strategy is useful as a fast adaptation of the two-dimensional string approach, it is sensitive to errors in segmentation of an image into objects and pattern recognition used to identify those objects. In fact, a single error in either process (segmentation or recognition) may cause a query to fail. Unfortunately, it is precisely this sensitivity to noise that must be overcome to guarantee reliable performance.
A bottom-up, featural information approach has been used to overcome some of the disadvantages of the top-down method. In a technique known as geometric hashing, "interesting points" extracted from a query image are matched to interesting points extracted from images in a document database. "Interesting points" are defined as points located by an application-specific operator operating upon the image. For example, in an application where an aerial photograph is matched to a database of aerial photographs of known locations where the query aerial photograph might not exactly match its match in the database, the operator would locate small areas of high gray level variance. The assumption with this application is that different versions of the same image will yield the same interesting points even though the versions may differ due to distorted caused by noise. For further discussion of geometric hashing, see Y. Lamdan and H. J. Wolfson, "Geometric Hashing: A General and Efficient Model-Based Recognition Scheme", Second International Conference on Computer Vision, 1988, pp. 238-249.
In a bottom-up query, a query image and a queried image from the document database are compared to each other by comparing interesting points. To correct for translation, rotation, and scaling, the interesting points of both images are normalized before comparing, where the normalization is a transformation which transforms a selected pair of points to the unit points (0,0) and (1,0). Other distortions can be accounted for by using more than two points in the normalization. After normalization, the normalized interesting points from two images are matched to each other. The two images are "equivalent" if an acceptable number of points are in one-to-one correspondence after some pair of normalizations.
Hashing has also been used to speed up the bottom-up query process by pre-computing all the normalized interesting points of each database image. The normalized coordinates of each point as well as the identity of the image and the parameters of the transformation are stored in a hash table. A query image is tested against the document database by computing a normalized version from each pair of its points. The coordinates of all the normalized points are then used to access the hash table. As each database document's hash table entry is compared, votes are accumulated for database images that have points in the same normalization position as a normalized point in the query image. A match is deemed to occur between a query image and a database image if enough votes are accumulated for one of its normalized versions.
This bottom-up process effectively uses low-level featural information as a substitute for the high-level recognition results used by the top-down method. However, the use of only single isolated points (the "interesting" points) in the matching process ignores the contextual information available from surrounding points which provide information about the relative positions of pluralities of features. This leads to a brute-force algorithm with a run-time complexity O(N.sup.3) (i.e. on the order of N.sup.3) for N interesting points in a query image, since each normalization might need to be tested.
From the above it is seen that an improved system for querying a document database with an input document is needed, where the documents in the database are allowed to include text and graphics, or where distortion and/or quantization noise are present in the documents in the database or the input document to preclude an exact character-by-character or pixel-by-pixel match.