The recent, unprecedented increase in the availability of information regarding entities (whether individual, organizations, etc.) has led to significant interest in techniques for protecting the privacy when such information when is made public and/or shared with others. Currently, many of the techniques for protecting privacy have arisen in the context of structured text, such as databases and the like. For example, U.S. patent application Ser. No. 12/338,483, co-owned by the assignee of the instant application, describes an anonymization technique that may be applied to structured data. Likewise, K-anonymity techniques are known whereby values of certain attributes in a table can be modified such that every record in the table is indistinguishable from at least k−1 other records. Further still, so-called L-diversity may be employed to ensure that sensitive data about an entity cannot be inferred through use of strong background knowledge (i.e., known facts about an entity that an attacker can use to infer further information based on redacted information) by ensuring sufficient diversity in the sensitive data.
In addition to structured text, organizations like intelligence agencies, government agencies, and large enterprises also need to redact sensitive information from un-structured and semi-structured documents (i.e., natural language text) before releasing them to other entities, particularly outside their own organizations. For example, confidentiality rules often stipulate that to release a document to external organizations (or to the public), the identity of the source as well as specific source confidential information (collectively referred to hereinafter as sensitive data or sensitive concepts) must be removed from the document. Thus a user must remove any uniquely identifying information that an attacker could use to infer the identity of the source. In such a process there is necessarily a tradeoff between redacting enough information to protect the sensitive concept, while not over-redacting to the point where the utility of the document (i.e., its usefulness for accurately conveying information regarding one or more specific concepts) has been eliminated.
Although manual document sanitization is well known in the art, it is a laborious, time-consuming process and prone to human error. To address this shortcoming, various automated redaction methods for use with natural language text based on data mining, machine learning and related techniques are known in the art. For example, k-anonymity has been applied to “unstructured” data by essentially treating natural language text data as a form of a database record. Still other techniques are known whereby desired levels of privacy are achievable. However, these techniques typically suffer from a significant loss in utility in the resulting redacted text.
Thus, it would be desirable to provide techniques that are effective for redacting natural language text while simultaneously balancing protection of sensitive information with preservation of utility of the original text.