A modern organization typically maintains a data storage system to store and deliver sensitive information concerning various significant business aspects of the organization. Sensitive information may include data on customers (or patients), contracts, deliveries, supplies, employees, manufacturing, or the like. In addition, sensitive information may include intellectual property (IP) of an organization such as software code developed by employees of the organization, documents describing inventions conceived by employees of the organization, etc.
Organizations invest significant efforts in installing DLP components, especially on important machines where confidential data is getting generated, but they may not be able to protect each computer in the enterprise, due to reasons like large number of different platforms or operating systems (OS), machine outages, quick and dynamic provisioning of virtual machines, no clear and individual accounting for test and lab machines. DLP technologies apply configurable rules to identify objects, such as files, that contain sensitive data and should not be found outside of a particular enterprise or specific set of host computers or storage devices and should be accessible to certain users who are authorized for the purpose. Even when these technologies are deployed, it is possible for sensitive objects to ‘leak’. Occasionally, leakage is deliberate and malicious, but often it is accidental too. For example, in today's global marketplace environment, a user of a computing system transmits data, knowingly or unknowingly, to a growing number of entities outside a computer network of an organization or enterprise. Previously, the number of entities were very limited, and within a very safe environment. For example, each person in an enterprise would just have a single desktop computer, and a limited number of software applications installed on the computer with predictable behavior. More recently, communications between entities may be complex and difficult for a human to monitor.
Conventional DLP systems typically use three methods to detect sensitive information in unstructured data (such as documents): 1) described content matching (e.g. regular expressions, keyword dictionaries); 2) content fingerprinting; and 3) machine-learning based content classification. These methods are effective when the information to be protected is exactly known, or can be described exactly using regular expressions and/or keyword dictionaries. Similar content has been used for training the machine-learning based classifier. These methods lose their effectiveness the moment there is new information which is sensitive but is not known to the DLP system a-priori. For example, in a software development firm new design documents are created frequently, and most of the time the content to be protected is completely new to the DLP system, which then is not able to identify the protected content. The methods described above do not perform a blanket identification of such design documents so that they may be protected from data loss. Similarly, pay statements are generated for each employee every month but they are all unique and unknown to the DLP systems. Currently such information is protected using described content matching techniques (e.g. regular expressions, keyword dictionaries) but their effectiveness is limited and they have a high rate of false positives.