Document anonymization involves removing personally-identifying information from a document. Typically, a document may be anonymized prior to publication or other widespread dissemination due to legal and/or privacy considerations. For example, medical records may be anonymized before public release to protect the medical privacy of patients. As another example, French law mandates that judicial decisions be anonymized prior to public release.
Document anonymization is a difficult task in part because some personally identifying information may be properly retained, while other personally identifying information should be anonymized. For example, when anonymizing a published judicial decision, information identifying the judge and the lawyers is typically retained, while information identifying clients and witnesses is removed. In the medical area, anonymization may remove information identifying patients while retaining information identifying medical personnel or medical facilities such as hospitals.
Document anonymization is also difficult because of linkages between entities named in a document. For example, a location typically should not be anonymized. However, the location may be contextually associated with a private person in a way which would indirectly identify the person, even with the person's name removed. For example, in the sentence:                In response, John Doe indicated that he would use his authority as mayor of Mayberry to block the new construction project.the name “John Doe” is an anonymous pseudonym for a real person who is to remain anonymous. However, by retaining the named location “Mayberry” the allegedly anonymized sentence still identifies the person, since the context shows that “John Doe” is the mayor of Mayberry, and the identity of the person holding that position is generally known. Similarly, the retention of dates, locations, titles, numbers, and so forth may, or may not, provide improper cues as to identity, depending upon context.        
Heretofore, document anonymization has typically been a manual procedure, due to the context-sensitive nature of the process, the wide range of variables involved in determining whether a particular entity should be removed, and the importance of avoiding inadvertent disclosure of private information. However, manual anonymization is labor-intensive. Publishers of anonymized documents would benefit from methods and apparatuses for providing automated assistance in the anonymization process.