1. Field of the Invention
The present invention relates generally to a method and apparatus for computing a measure of similarity between two documents, and more particularly, to a method and apparatus computing a measure of similarity using lists of document keywords.
2. Description of Related Art
Generally, hardcopy documents continue to be used as a medium for exchanging human readable information. However, existing electronic document processing systems, on which electronic documents are generated and later transformed to hardcopy documents using printers or the like, have created a need to recover an electronic representation of a hardcopy document.
The need to recover electronic representations of hardcopy documents arises for reasons of efficiency and quality. Generally, a document in electronic form can be used to produce hardcopy reproductions with greater quality than if they were reproduced from one of the hardcopy reproductions. Also, it is generally more efficient when revising a document to start from its electronic form than its scanned and OCRed counterpart.
U.S. Pat. No. 5,486,686, entitled “Hardcopy lossless data storage and communications for electronic document processing systems”, which is incorporated herein by reference, provides one solution to this problem by allowing hardcopy documents to record thereon machine readable electronic domain definitions of part or all of the electronic descriptions of hardcopy documents and/or of part or all of the transforms that are performed to produce or reproduce such hardcopy documents.
Another solution is disclosed in U.S. Pat. No. 5,893,908, entitled “Document management system”, which provides automatic archiving of documents along with a descriptor of the stored document to facilitate its retrieval. The system includes a digital copier alert that provides an alert when an electronic representation of a hardcopy document sought to be copied is identified. Further, the document management system automatically develops queries based on a page or icon that can then be used to search archived documents.
However, these and other known solutions lack flexibility by either requiring a hardcopy document to include machine readable instructions or pre-processed feature information associated with electronic documents. Accordingly, it would be desirable to provide a system that is adapted to locate electronic representations of hardcopy documents independent of machine readable information and pre-processed descriptions. Such a system would advantageously operate using either hardcopy or electronic forms of documents as input.