This specification relates to data leakage detection.
The unauthorized distribution of confidential information, such as trade secrets, financial information, and other sensitive information can be protected by a number of security measures, such as access restrictions, password protection, and encryption techniques. While such security measures are often effective, confidential information that is subject to such measures can still be distributed inadvertently or surreptitiously. Such disclosures can be characterized as “data leaks.” For example, confidential information can be communicated by text in an e-mail message or an instant message; by attaching a document to an e-mail message; by accessing a company website over an unsecured network; and so on. Whether committed unintentionally or intentionally, the disclosure of confidential information by data leakage can cause financial harm, embarrassment, or other injury to a company or individual.
There are many different data leakage protection schemes, such as regular expression checkers that identify structured data (e.g., credit card numbers); database fingerprint matching; file matching (either complete or partial); statistical analysis; and so on. One particular protection scheme is phrase matching, which is a technique of matching regular expressions in the presence of noisy words. FIG. 1 illustrates a state diagram of a phrase matching model that is configured to detect the phrase “Private And Confidential.” Normally phrases are matched using regular expressions (w1*w2*w3), and other words within a noise margin are ignored. For example, matching a formatted string “<bold>Private </bold> and <bold> Confidential </bold>” would treat <bold> and </bold> as noise. Too much noise indicates too many words in between the phrase terms, and cause state model to revert to a previous state. For example, the sentence “Private information, and requires the authentication of confidential data access privileges” includes the phrase words for “Private And Confidential.” However, the sentence includes too many noise words that obfuscate the intent of the original phrase.
Transition tables can be used to implement the state model. For a K-word phrase, however, there are K+1 forward states and K−1 noise states, which amounts to a transition table size of K2. Thus, as more phrases are added, the state model grows more complex, and computational resource requirements likewise increases geometrically.