The present disclosure relates generally to enhancing the performance of duplicate identification to users of an enterprise document and content management system where efficiency, scalability, and security are highly important.
Generally, hardcopy documents continue to be used as a medium for exchanging human readable information. However, existing electronic document processing systems, on which electronic documents are generated and later transformed to hardcopy documents using printers or the like, have created a need to recover an electronic representation of a hardcopy document.
The need to recover electronic representations of hardcopy documents arises for reasons of efficiency and quality. Generally, a document in electronic form can be used to produce hardcopy reproductions with greater quality than if they were reproduced from one of the hardcopy reproductions. Also, it is generally more efficient when revising a document to start from its electronic form than its scanned and OCRed counterpart.
U.S. Pat. No. 5,486,686, entitled “Hardcopy lossless data storage and communications for electronic document processing systems”, which is incorporated herein by reference, provides one solution to this problem by allowing hardcopy documents to record thereon machine readable electronic domain definitions of part or all of the electronic descriptions of hardcopy documents and/or of part or all of the transforms that are performed to produce or reproduce such hardcopy documents.
Another solution is disclosed in U.S. Pat. No. 5,893,908 entitled “Document management system”, which provides automatic archiving of documents along with a descriptor of the stored document to facilitate its retrieval. The system includes a digital copier alert that provides an alert when an electronic representation of a hardcopy document sought to be copied is identified. Further, the document management system automatically develops queries based on a page or icon that can then be used to search archived documents.
Identification of duplicate documents by their content is becoming increasingly important in enterprise environments. Examples of the performance issues with respect to enterprise environments include 1) to insure data consistency (i.e. everyone works off the same document); 2) to remove clutter; 3) to save data storage space; and, 4) to protect a company from unnecessary liability and comply with regulations.
Current available tools fail to meet user's expectations in terms of performance. Users perceive duplicate identification as a “search” operation and expect a comparable level of responsiveness. In enterprise environments where access, control is enforced, users typically expect search results to be presented within a few seconds.
Current available tools rely on performing content comparison in real time as the user waits. Contact comparison requires that a set of possible matching documents be fully retrieved by content, and each compared to the document of which duplicates are identified. Since content comparison is a computationally expensive operation, the response time for real time can be at least an order of magnitude more than a typical search response time for the same size of data. Even worse, since the set of possible matching documents increases with the size of a repository, the difference of response time grows in exponential proportions as the size of the repository increases.
Patent application Ser. No. 10/605,631 to Franciosa et al. partially addresses the performance issue by making the content comparison operation independent of file sizes. It asserts that, with little or no effect on match accuracy, only a set number of words need to be compared against the original and the suspect duplicate document. Comparing only a set number of words addresses a small portion of the response time, but the need to compare partial content of all suspect documents remains, and thus, the exponentially growing performance problem remains.
As with many typical software performance problems, the absolute performance can be enhanced by increasing the performance of the hardware. However, the relative performance difference between a search and a real time duplicate identification operation still exists. Moreover, the cost of the overall system increases with faster performing and more expensive hardware.
Identifying duplicate documents is becoming important in response to regulations increasingly dictating that companies take complete responsibility for their data. In cases such as health insurance portability and accountability act (HIPAA) and records management for Sarbanes-Oxley compliance, companies need to identify and lock down all applicable content. Identification of duplicate documents therefore becomes essential, since it doesn't make sense to lock down data in one place while having the same data available in other places. Equally important, is to protect an organization against unnecessary liability, data needs to be deleted after the regulated retention. Furthermore, in enterprise environments where individuals collaborate to create data, yet independently save data, duplicate data confuses users and creates clutter that can negatively impact a user's experience. Storing duplicate data also creates unnecessary burden on storage requirements and increases the operating costs to the organization.