Data Loss Prevention (DLP) involves computer and information security, where DLP systems identify, monitor, and protect data in use (e.g., endpoint actions), data in motion (e.g., network actions), and data at rest (e.g., data storage). Such data may be in the form of files, messages, web requests or the like. Typically, a DLP system monitors various files, messages, etc. to determine whether they constitute use-restricted documents. A use-restricted document represents a document that cannot be freely distributed or manipulated due to its sensitive nature. Use-restricted documents may be marked with such words as “confidential,” “sensitive,” “stock,” etc. to indicate their sensitive nature. In addition, use-restricted documents may include confidential information such as customer, employee or patient personal information, pricing data, design plans, source code, CAD drawings, financial reports, etc.
A DLP system may determine whether a file or a message is a use-restricted document by applying a DLP policy. A DLP policy may specify what data should be present in a file or message to be classified as a use-restricted document. For example, a DLP policy may specify one or more keywords (e.g., “confidential,” “sensitive,” “stock,” names of specific diseases (e.g., “cancer,” “HIV,” etc.), etc.) for searching various files, messages and the like. However, rigid matches on keywords are limiting because they do not account for situations in which a user misspells a word in a document by mistake or intentionally to fool the DLP software. For example, “SENSITIEV” and “SENISTIVE” are both slight variations of the word “SENSITIVE.” The meaning of these variations can still be understood by a human user but not by the DLP software configured to perform a conventional keyword matching.