A modern organization typically maintains a data storage system to store and deliver sensitive information concerning various significant business aspects of the organization. Sensitive information may include data on customers (or patients), contracts, deliveries, supplies, employees, manufacturing, or the like. In addition, sensitive information may include intellectual property (IP) of an organization such as software code developed by employees of the organization, documents describing inventions conceived by employees of the organization, etc.
Organizations invest significant efforts in installing DLP components, especially on important machines where confidential data is getting generated, but they may not be able to protect each computer in the enterprise, due to reasons like large number of different platforms or operating systems (OS), machine outages, quick and dynamic provisioning of virtual machines, no clear and individual accounting for test and lab machines. DLP technologies apply configurable rules to identify objects, such as files, that contain sensitive data and should not be found outside of a particular enterprise or specific set of host computers or storage devices. Even when these technologies are deployed, it is possible for sensitive objects to ‘leak’. Occasionally, leakage is deliberate and malicious, but often it is accidental too. For example, in today's global marketplace environment, a user of a computing system transmits data, knowingly or unknowingly, to a growing number of entities outside a computer network of an organization or enterprise. Previously, the number of entities were very limited, and within a very safe environment. For example, each person in an enterprise would just have a single desktop computer, and a limited number of software applications installed on the computer with predictable behavior. More recently, communications between entities may be complex and difficult for a human to monitor.
Conventional DLP systems typically use three methods to detect sensitive information in unstructured data (such as documents): 1) described content matching (e.g. regular expressions, keyword dictionaries); 2) fingerprinting; and 3) machine-learning based content classification. These methods are effective when the information to be protected is exactly known, or can be described exactly using regular expressions and/or keyword dictionaries. Similar content has been used for training the machine-learning based classifier. These methods lose their effectiveness the moment there is new information which is sensitive but is not known to the DLP system a-priori. For example, in a software development firm new design documents are created frequently, and most of the time the content to be protected is completely new to the DLP system, which then is not able to identify the protected content. These methods do not perform a blanket identification of such design documents so that they may be protected from data loss. Similarly, pay statements are generated for each employee every month but they are all unique and unknown to the DLP systems. Currently such information is protected using described content matching techniques (e.g. regular expressions, keyword dictionaries) but their effectiveness is limited and they have a high rate of false positives.
There are conventional DLP products that allow classification of email messages by assigning TAGS (such as updating SMTP headers etc). In these products, the sender typically classifies the email message by attaching some TAG before sending it to others. For example, a CEO wants to send an internal and confidential email message to employees of an organization, but the content of the email message cannot leave the organization. However, using the conventional DLP products, an employees may not be able to forward this email message to their private mail account because of email filters detecting the TAG, but the employee may be able to still print out the email message, copy and paste the content into a document, and send the document outside of the organization as an attachment to circumvent the email filters. Thus, the TAGS can be lost or circumvented.