1. Technical Field
This disclosure relates generally to protection against unintentional disclosure of confidential information in a computing environment.
2. Background of the Related Art
The types of sensitive information that arise in a typical enterprise environment may be quite varied. Such information includes, without limitation, intellectual property (e.g., code, designs, documentation, other proprietary information), identity information (e.g., personally identifiable information (PII)), credit card information (such as PCI-related data), health care information (such as HIPAA-related data), finance information (such as GLBA-related data), and the like. Often, it is desired to maintain some or all of that information as “confidential”—i.e., known only within the enterprise, or to certain permitted individuals or systems within the enterprise.
A problem arises, however, when referencing a document that contains both confidential information and non-confidential information, especially when that document may need to live (in whole or in part) external to the enterprise (or to a particular system thereof). Consider, for example, a customer filing a problem management record (PMR) with an external support provider. That problem record, which may have been generated in an automated manner, may include both confidential information, as well as information about the problem. The non-confidential information, if it could be extracted, may have independent value (e.g., if published in a support note). In such case, however, it would be necessary to remove or redact the confidential information. Removing or redacting the confidential information manually, naturally, is prone to errors of identification and unintentional omission. Publication of even a seemingly innocuous piece of information can create a significant legal or financial liability.
Existing confidential data detection solutions often rely on various strategies to prevent confidential information from being disclosed inadvertently. In one approach, a list of confidential items or terms is used; a document is compared against this list to identify portions that might require omission or redaction. Assembling and maintaining such a list, however, are non-trivial tasks. Another approach is to run the document against a simple tool, such as a spellchecker to allow irregularities to be exposed (and which then might be acted upon proactively). This approach, however, produces a large number of both false positives and false negatives. Yet another approach involves data string matching, e.g., searching for and removing terms matching a particular format (e.g., (###) ###-####), but this approach is narrow in scope. Other known approaches involve machine learning systems, pattern matching, and the like.
There remains a need in the art to provide for enhanced techniques to identify and distinguish confidential and non-confidential information from within a document (or, more generally, a data item) so that the confidential information may remain protected against inappropriate disclosure.