1. Field of the Invention
The present invention addresses the problem of image-based indexing and classification in image databases. More particularly the present invention addresses the problem of indexing and classifying images, e.g., signatures, logos, stamps, or word spotting, i.e., word identification within an image; for search, analysis and retrieval in a document collection.
2. Discussion of the Related Art
Among the drawbacks in the known art of digital image databases, the identification of documents based on image similarity requires a composition of time consuming complex similarity measures. Further, the number of required similarity comparisons is proportional to the square of the number of documents in the database. Also, the known art does not address image-based classification of documents. Therefore, known techniques for the recognition of images are limited to a dataset of several thousand documents.
Thus, given a large collection of documents, e.g., such as those commonly related to legal investigations; and the task of obtaining those documents containing homolog images, such as signatures of a specific person, or containing a certain logo or stamp, or containing a certain handwritten word; these tasks are not possible with state of the art techniques. This is due to the complexity of image based similarity measures and the large number of comparisons that must be performed.
For example, in the application of signature-based document classification, documents must be characterized as signed by a certain individual; i.e., the application must classify documents according to signatures and index the documents accordingly. However, each time an individual signs a document, the signature will vary. As the exact image content of each homolog signature is unknown due to natural signature variation, current techniques need to measure the similarity of each signature to all others in the database, thus leading to a large number of comparisons, namely O(N2), i.e., the number of comparison operations to be performed is proportional to the number of documents squared.
Given that the number of documents, N, in the database of, for example, a legal investigation, may easily be in the millions, known techniques require a number of comparisons which is too computationally expensive, i.e., lengthy, to be of any practical value. Thus, image comparison databases are currently limited to only a few thousand documents.
There is therefore a need to provide for effective image-based digital database management of a realistically large number of documents, including especially a need to speedup the similarity determination and classification process.