The present invention relates to the field of electronic document management, more specifically to document management systems where a target document is retrieved using an example of content of the target document.
U.S. Pat. No. 5,464,353, issued to Jonathan Hull, et al. (application Ser. No. 08/222,281 filed Apr. 1, 1994 and currently pending) entitled "Image Matching and Retrieval by Multi-Access Redundant Hashing" (incorporated by reference herein and referred to as "Hull") disclosed a new method for retrieving a document from a document management system where the input to the system is a sample page from the target document. In that system, descriptors are extracted from each document being stored and those descriptors are stored in a descriptor database. To retrieve a target document, only a sample page or portion of a page is needed. The sample page is presented to the document management system, descriptors are extracted from the sample page and then they are matched to descriptors in the descriptor database. Since many descriptors are taken from each stored document and from the sample page, they are redundant. As explained by Hull, where many descriptors might match between the target document and the sample page, but errors are not fatal to the search. In that system, documents accumulate votes based on matches of descriptors and the document with the highest vote count is returned as the target document.
Of the descriptors disclosed by Hull, graphical descriptors looked to key features of the graphics on a page, whereas text descriptors looked to the pattern of letters or word lengths. However, the document management system of Hull uses an optical character recognition system to recognize characters from a digitized image of a page of a document or a sample page in order to form the descriptors for the page image. Since this is a computationally expensive operation, a more efficient method for generating descriptors from text is needed.