Applications in machine learning, information retrieval, text processing, data mining and Natural Language Processing (NLP) research typically requires large amounts of data for proper testing and validation of the correctness of the technical routines that implement the applications. Data that is most desired for testing and validating the efficiency and correctness of the techniques and routines in a business domain is typically real-life data that includes confidential information. Examples of such confidential information are customer transaction information, customer preferences, customer feedback and survey feedback. Other domains where data sharing is useful for providing better services to the customer, but restricted because of privacy considerations, include the health domain. While some text corpora are available in the public domain in specific areas (for example, the Enron database comprising emails of Enron employees), in general a lack of availability of real-life data and the conflict between the needs of data-privacy and data-sharing are impediments for research and development of applications in these fields. These problems also prevent the full use of applications that require sharing of information, such as occurs in the health domain. Also, many enterprises today outsource some parts of their business applications to third parties, for efficiency and cost reasons. The data that also needs to be shared may sometimes contain personal or sensitive information and it may not be legally permissible to share the data as such with the third parties. In these cases, it is useful to have techniques that clean the data of sensitive information, before it is made public or otherwise disseminated.
Data sanitization or data obfuscation techniques refer to techniques that remove or replace the sensitive text or information in confidential documents, in a manner that does not expose the identifiable information or the confidential information. When these desensitized or obfuscated documents are shared, the end users are then not able to gather any personal information related to individual data entities. In some cases these techniques have also been extended to clean data in such a manner that even aggregate information cannot be gleaned from the sanitized data. These techniques are also referred to as data anonymization, data cleaning or desensitization. The techniques are designed such that no confidential information is disclosed to the end user, but enough information is retained for other analytical and processing applications that the end-user may wish to perform on the data.
However, current obfuscation techniques are widely available only for numerical data and for text data in structured format (typically in relational tables). In the few cases where the obfuscation is done on unstructured text, this has been restricted to simply removing the sensitive information from the original text and replacing it by blanks or some dummy tags. This can lead to a loss of the form of the original document.
These current obfuscation techniques broadly fit into 2 categories, based on the type of data addressed—numerical data and text data. The standard techniques used in both categories include data randomization, data swapping and data anonymization, where the sensitive data is replaced with a fixed value or an interval of values. The overall goal is to prevent the reconstruction of the initial data. For numerical data, the objectives have been to obscure information at the level of individual records, while preserving aggregate properties, for various data mining applications. Examples include preserving statistical properties such as the mean and the variance.
For text data, the work done may be classified broadly under the heads of structured data (or data available in relational databases) and unstructured data (such as plain text documents). Most of the existing work focuses on anonymization of structured data. The main application areas have been the health domain. Anonymization of medical information has been performed using techniques of generalization and suppression, where the anonymization is achieved to the extent that a particular record cannot be identified within k other records, for some predetermined number k. Other approaches use techniques from information retrieval for entity identification and subsequent replacement by some dummy text, and techniques for obfuscation of sensitive information in spoken language databases (text documents of speech recordings).
In the case of structured text in a relational format, in some instances the task of obfuscation is very straightforward, for example where a column containing sensitive information is completely hidden or deleted before the data is published. In the case of unstructured information, the task is more complex, since initially the information needs to be identified before it is replaced. Once the sensitive information is identified, some of the techniques used for replacement include:                Simple deletion: each occurrence of personal identifiable information is deleted; for example, ‘Dear Jane’→‘Dear . . . ’;        Fixed transformation: each instance of the information to be hidden is replaced as in, ‘Dear Jane’→‘Dear<NULL>’ or ‘Dear<Person>’;        Partial masking: some parts of the information are replaced, as for example, the date column in a date field comprising date, month and year, or the location code in a telephone number (eg: 410-788-5230→410-2-2X).        
In the above techniques, there is some loss of data in the transformed text, at the cost of efficiency and preserving privacy. Further, when an end user accesses the document either intentionally or unintentionally, the user is aware of which parts of the text contained the sensitive information.