Data security has been an important issue for many entities (e.g., government, businesses, schools, etc.) for many years. In particular, data extrusion or data leak, which broadly refers to inadvertent and/or malicious disclosure of data, has received much attention because of its potentially severe consequence. For example, malicious disclosure of a company's trade secret to a competitor could result in a severe loss of competitive advantage. Furthermore, an entity may incur significant liability because of malicious disclosure of sensitive data (e.g., social security numbers, medical records, etc.) of others, such as clients, employees, etc. Many law and regulations also mandate substantially complete control of certain data in many entities, such as Sarbanes-Oxley Act, Health Insurance Portability and Accountability Act (HIPAA), Gramm-Leach-Bliley Act, Rule 17a-4 promulgated by United States Securities and Exchange Commission (SEC), etc.
Many protocols and procedures have been developed over the years to prevent data leak or data extrusion. These protocols and procedures may also be referred to as data leak prevention policies or data extrusion prevention policies. Conventionally, some entities hire security staff to manually review communications sent out of the entities' networks (e.g., local area network (LAN)). For example, a security staff member of a company may manually review every electronic mail sent to a recipient outside of the company or between different groups, departments and individuals within a company. However, this approach suffers from many disadvantages, including low speed and potential compromise of privacy and/or confidentiality of sensitive data of others. For example, an electronic mail from an employee of a company to his spouse may disclose an ailment suffered by the employee. The security staff member reviewing this electronic mail would learn about the medical condition of the employee in the course of reviewing electronic mails sent out of the company. As a result, the company may incur liability for the invasion of the employee's privacy.
Some conventional data extrusion prevention policies attempt to automate the review process in order to speed up the process as well as to avoid disclosure of personal information to a security staff member. However, it has been difficult to automate the review process because of various reasons. For example, certain prior at systems make use of Regular Expressions based automatic matching of data format, such as Social Security Number or Credit Card Information, etc., to discover sensitive information embedded in a set of content being communicated to unauthorized recipients or agents. While this technique is simple to employ, it is also severely limited to discovery of data that is intrinsically well structured in nature and hence can be represented as a set of Regular Expressions. This prior art technique is widely used by the Payment Card Industry (PCI). More complex data such as semi-structured and unstructured data is extremely difficult, if not impossible, to be captured in Regular Expressions.
Another certain prior art makes use of lexical matching techniques wherein a pre-specified set of keywords are used to search and discover the existence of sensitive and confidential information embedded in the content of unauthorized communication and/or disclosures between agents. This prior art technique also has major limitations as many such keywords used in this technique may or may not be of any significance when discovered in a set of content under certain circumstances. For example, this technique lacks the ability to discern between a benign use of a certain keyword under certain circumstances versus situations where an actual event of sensitive data leak may be taking place. This prior art is hence also prone to generating high number of False Positives and False Negatives.
One prior art technique to automatically review data uses document fingerprinting, which is ineffective in terms of differentiation of personal matter versus controlled subject matter. Furthermore, a security staff member is still needed to manually intervene the process by manually setting various policies (e.g., access control policies) and creating instruction documents (e.g., memorandum, files, etc.).
Regardless of what technique is being used to discover and identify sensitive information, the existing data leak prevention (DLP) technologies and solutions require extensive manual effort in pre-specifying security policies for each data and file (documents, memos, emails, spreadsheets, or any other structured, semi-structured or unstructured data).
This need for pre-specification of security policies is another major limitation of the prior art techniques and algorithms for Data Leak Prevention (DLP).
There has, therefore, been a long felt need to provide techniques to perform better and deeper contextual and conceptual analyses of the content of structured, semi-structured and unstructured data in a manner that allows for automatic creation of appropriate security policies and automatic application of those security policies in real-time with minimal possible False Positives and False Negatives.