This disclosure and the exemplary embodiments provided herein relate to a document processing method and system to support document classification and/or clustering while maintaining privacy of information included in the document(s).
According to an embodiment disclosed herein, the exemplary method identifies recurring paper-based tasks by storing and analyzing print logs, estimates the impact of each task in terms of consumable usage, such as in terms of paper volume and/or power consumption, and identifies constraints that explain the reasons for printing, allowing identification of the barriers that prevent moving these tasks from paper to digital form. The exemplary method performs these document content analytics while maintaining privacy of information included in the analyzed printed documents/papers, thereby enabling a third party to complete the document content analytics assessment
In current practice, paper document content analytics is done in a completely manual fashion, through surveys and interviews, directly with the customers and their employees. In U.S. Patent Publication No. 2014/0247461, published Sep. 4, 2014, by Willamowski et al. and entitled “SYSTEM AND METHOD FOR HIGHLIGHTING BARRIERS TO REDUCING PAPER USAGE”, a method to partially automate this process using machine learning techniques is disclosed. This method enables automatic analysis of printed documents' content to cluster and classify the documents and requires manually labelled documents for training. Two issues arise in the context of manual document labelling: privacy on one hand and obtaining a sufficient set of consistently labelled documents on the other hand. Privacy is also a concern for customers with respect to the automatic document content analysis step: indeed, customers do not want to disclose their document content to third parties, which in turn prevents resorting to external services for the automatic document content analysis.
The privacy issue with respect to manual labelling is the following: to correctly label a document, the labelling person needs to be able to access, visualize and understand the document and its content. To avoid any issue, in the method proposed in U.S. Patent Publication No. 2014/0247461, published Sep. 4, 2014, by Willamowski et al. and entitled “SYSTEM AND METHOD FOR HIGHLIGHTING BARRIERS TO REDUCING PAPER USAGE”, the document owners themselves label the documents. The privacy issue rises if the labelling is delegated to another person, different from the document owner. However, employing a unique, possibly external subject matter expert to do the labelling would enable obtaining a sufficient set of consistently labelled documents and this within a limited time frame.
Provided herein is a method and system to obfuscate print document content prior to the labelling step. The method and system provides privacy and retains sufficient details of the document content to enable adequate labelling. It thus allows delegating the labelling process to external persons. Furthermore, the disclosed method and system allows disclosing and delivering the obfuscated documents to an external service provider for the automatic document content analysis.
U.S. Patent Publication No. 2014/0247461, published Sep. 4, 2014, by Willamowski et al. and entitled “SYSTEM AND METHOD FOR HIGHLIGHTING BARRIERS TO REDUCING PAPER USAGE”, discloses a system/method for highlighting barriers to reducing paper usage: This disclosure provides a system and method to help organizations to move from paper to digital workflows by (1) identifying recurring paper-based tasks, (2) estimating the impact of each task in terms of paper volume, and (3) identifying the barriers that prevent moving these tasks from paper to digital. Patent Publication No. 2014/0247461 combines automatic clustering/categorization of print documents with manual labelling of those documents with the corresponding task and reason for printing. One limitation of this method is that, in order to guarantee privacy, only the document owner can be asked to do the labelling. The method and system provided herein palliates this problem, ensuring privacy through appropriate obfuscation of the document content, and thereby allowing subject matter experts to label the print documents accordingly.
U.S. Pat. No. 8,666,992, issued Mar. 4, 2014, by Serrano et al., and entitled “PRIVACY PRESERVING METHOD FOR QUERYING A REMOTE PUBLIC SERVICE” discloses a privacy-preserving method for processing a multimedia document by a public remote service: The objective here is to submit a multimedia document (image, sound, and video) to a remote service (similar document search, document categorization, etc.) without revealing its content. The method makes use of an external database to first select documents similar to the private document, then submits the returned similar documents to the remote service and finally collects and combines the results returned from the service constituting a proxy of the results that would have been obtained by using the private document directly. In contrast, the method and system disclosed herein retains as much detail as possible and/or necessary from the original document in order to enable humans to visualize, annotate and process the document content properly.
U.S. Pat. No. 8,812,870, issued Aug. 19, 2014, by Jean-Luc Meunier et al. and entitled “CONFIDENTIALITY PRESERVING DOCUMENT ANALYSIS SYSTEM AND METHOD” discloses a confidentiality preserving document analysis service where a document owner desires an external service to process a document without disclosing the contents of the document to the external service. The method encrypts the document content prior to sending the document to the external service, and decrypts the returned content and/or re-constructs the output document from the external service provided result. U.S. Pat. No. 8,812,870 is based on the distinction of document meta-data and document content, and assumes that the meta-data can be disclosed while the document content is encrypted. The meta-data typically consists of localization information that can be used by the remote external service to analyze the document structure without knowing and exploiting the textual content. As discussed with regard to U.S. Pat. No. 666,992, in contrast, the method and system disclosed herein is that the meta-data of the document is retained, but also as much as possible of the document content is retained in order to enable humans to annotate and process the document content which only includes publicly accessible information.