In a variety of situations, stored text may include personal information. Personal information can vary from relatively low-sensitive information, such as a person's name, to highly sensitive information, such as a social security number or a credit card number. An entity that stores text may desire to remove at least some personal information from the stored text, for example for privacy or liability concerns.
Personal information may be expressed in text in a variety of ways, and many conventional techniques, such as simple rule based approaches or regular expressions, may not provide sufficient accuracy in identifying different types of personal information. Additionally, removing personal information from text may limit the usefulness of the text for later applications, such as classifying the text.
Presently known systems to identify and protect personal information suffer from a number of drawbacks. For example, multiple regulatory schemes exist that may include different definitions of personal information may protect differentiated privacy concerns and therefore protect different aspects of personal information. Further, even unintentional releases of personal information can result in significant liability, reputational impact to the host of the information, and even criminal liability in certain circumstances. The required protections for data that includes personal information may be expensive and cumbersome to implement, and accordingly where grey information is produced, for example that may include personal information but the host is not certain, expensive processes to protect the information may be over-inclusive resulting in costs that are not necessary. Additionally, where personal information is included within other information where it is not expected, for example where a customer, patient, or other entity provides information in an unexpected manner, the overall information may not be sufficiently protected because the host of the information did not recognize or expect that personal information would be included within the information. Additionally, privacy policies of an entity (e.g., a hospital, a social media website, and/or a customer service provider) may exceed or otherwise vary from regulatory schemes, resulting in further complexity in identifying personal or other sensitive information.
Additionally, it may be desirable to share some of the information related to the personal information, such as for studies, data mining, law enforcement requests, development of efficient processes, or other purposes, but it may also be required to keep data including personal information for other purposes. Presently known systems may require that either the full information be shared, with consequent risks and expense related to managing the sharing of the full information, or that the data set be overly redacted reducing the utility of the information. Presently known systems may also not adapt to multiple personal information schemes, where configured data sets can be rapidly prepared with high confidence for sharing in multiple jurisdictions and/or for multiple purposes, each of which may have a distinct set of determinations for which aspects of the data include personal or other sensitive information.
An entity that stores text may desire techniques for accurately identifying and removing personal information from text in a manner that maintains the usefulness of the modified text in later applications, and/or that can be configured for multiple purposes.