1. Technical Field
The present invention relates to anonymizing selected information or content within a document, and more particularly to identifying and concealing by appropriate means all sensitive or critical contents from the document based on user access privileges and a context, such that the document may be distributed to across a broader audience.
2. Description of the Related Art
Documents containing private and sensitive information occasionally need to be released to a broader audience. U.S. Pat. No. 7,184,947 describes a document anonymity setting device comprises a document input means inputting a document, a specificity calculating means extracting an expression specifying a person from the input document and for calculating a specificity to evaluate a degree of specificity at which the expression specifying a person, and an anonymity setting processing means rewriting with anonymity setting a expression in the input document having a specificity which is greater than a predetermined threshold. The specificity calculating unit extracts a person name and a modification expression from the input document and calculates a specificity to evaluate a degree of specificity at which the person name and modification expression thus extracted can specify a person. The anonymity setting processing unit rewrites a person name and a modification expression which have specificity greater than a predetermined threshold through rewriting to meaningless expression, rewriting to low specificity setting, and rewriting to encrypted expression. This document deals with the problem of automatically identifying the sensitive personal information in a given document. This is done by first identifying personal names and modifying expression via lexical and syntactic analysis. Next, the probability of these identifying a specific person is calculated. Phrases having a probability more than a threshold are removed.
In recent years, the document data which include personal information. For example, there are questionnaire answers, a complaint or an electronic mail. There is a problem in that the existence of a company is threatened if the personal information leak out of a company. Therefore, it is necessary to properly conceal information about personal information before analyzing the document data. Conventionally, personal information such as person name, phone number, credit card number, and etc. included in the document data or the like have been concealed manually. In the conventional concealment of the personal information, however, it is hard for a worker to decide whether a modification expression related to a personal name or a person which is described belongs to information protected as the personal information or does not need to be protected like information about a public person. Therefore, there is a problem in that the properness for concealing the personal information is varied depending on each person. For this reason, when a worker conceals a personal data the worker's skill and knowledge for concealing the personal information should exceed a certain level. Therefore, the cost of concealing the personal information manually is increased easily.
For example, the fight to information regulations in most countries allows general public to request access to government documents. In most cases such documents contain sensitive information not critical to the information sought. There is a need therefore to sanitize (redact) the document by removing terms in the document that tend to disclose sensitive information. The sanitized document gives away limited information while keeping away the sensitive information in the document. FIG. 1 illustrates an example U.S. government document 100 that has been sanitized prior to release. The document 100 contains content or information 110 which are visible to a reader and contents that have been blackened 120 which are not visible to a reader of the document. The document 100 is a typical example of a sanitized document which gives limited information to a reader. In this particular case, the sanitized document 100 gives limited information, such as the purpose and the funding amount, on an erstwhile secret medical research project, while hiding the names of the funding sources, principal investigators and their affiliation, which is not required to be disclosed to general readers of the document.
A disadvantage with known systems and method of sanitizing documents manually makes it subjective and prone to judgmental errors. Moreover, given the amount of effort involved and limited supply of qualified reviewers, manual sanitization is an expensive and time-consuming process. Therefore, without a way to provide an improved method of sanitizing documents, specifically contents available within a document, the promise of this technology may never be fully achieved.