Data Leak Prevention (“DLP”) seeks to detect potential leaks or data breaches. This is where sensitive data (e.g., confidential data and/or data having a business value) is disclosed to unauthorized entities (typically outside a company) with data leaving the perimeter via a variety of possible ways such as, for example, through USB keys, emailing, uploading to an external website, etc. For the purposes of description, the term “perimeter” may refer to the external boundary of an organization. In a network context, it may refer to where the internal network connects with the external network (internet etc). It can also refer to delineation within an organization between divisions or where there are differing security postures.
There are three types of conventional DLP solutions typically available: (a) network DLP that monitors the egress points at the perimeter to detect unauthorized data traversing defined boundaries; (b) endpoint DLP that runs on an end user's device and monitors the end user behaviour and communications, blocking attempts to move sensitive data via unauthorized means, such as, for example, USB keys or via instant messaging; and (c) storage-based DLP solutions that deal with data residing on a server or device; these storage-based DLP solutions may, for example, mitigate against the risk of a person's computer (e.g., laptop computer) going missing and an unauthorized party being able to retrieve the sensitive data directly off of the storage (e.g., hard drive) of the computer. For the purposes of description, the term “defined boundaries” may refer to the delineations that can exist within an organization, between organizations, or between an organization and the outside world. They are essentially enforcement points where the flow of traffic may be controlled due to differences in the sensitivity.
Although conventional enforcement and blocking mechanisms are relatively mature, the mechanisms by which these solutions determine what is and what is not sensitive data are relatively immature and are typically based primarily on heuristics (such as word scanning, data types, and pattern matching against a lexicon of sensitive terms, or similar). Two specific examples are scanning for credit card numbers in an email or data that resembles patient records in a hospital. The drawbacks of this approach are that: (a) this approach typically requires a human to determine the uniquely identifying characteristics of the data that must be prevented from being leaked; and (b) this approach typically does not evolve or adapt over time without human intervention to update the filters, heuristics or similar.