Electronic document management is a challenging task for organizations large and small. Many thousands of hours and millions of dollars are wasted on efforts searching for misplaced electronic documents and recreating documents when a user is unable to locate the original. In some cases, the user may possess a physical or other non-native copy of the document, but is unable to locate the original electronic document, which may be stored somewhere on a network drive or a data repository, e.g., enterprise content management (ECM) repository. The user may recreate the document, but even with high-quality reconstruction, the reconstructed document may not be identical to the original electronic document.
The user may attempt to find the electronic document by searching the network drive or data repository for strings from the document text. For example, the user may scan the hardcopy and use Optical Character Recognition (OCR) software so that comparisons can be made to find a match in the network drive or EC repository. However, simple text searches may not always be sufficient. For example, if the document lacks text or if the text is not well formed, a search cannot be performed because the OCR software is unable to recognize non-text objects. As another example, if the document contains only very common words, the search may return far too many results.