The present invention relates to methods for automatic categorization of internal and external communication for preventing data loss.
Data loss occurs when data that belongs to a group of people is exposed to someone outside of the group. Typically, data loss occurs when a person who has access to data passes the data to another person who, by policy, should not have access to the data (e.g. by e-mail). The action causing the data loss may be accidental or intentional. In order to prevent such a situation, organizations typically install a data-loss prevention (DLP) system. Common DLP systems perform the following processes:
(1) classify all data in the organization;
(2) define who can access which data;
(3) monitor data flow; and
(4) block communication if data goes to an unauthorized user.
The classification process is resource- and time-consuming. In addition, it is an ongoing process, since new data is constantly being created over time. In some cases, it is impossible to access all data a priori, and there is no one that has sufficient privileges to even classify the data.
The problem of preventing data loss is in some ways similar to other data classification problems, such as that of preventing unwanted incoming e-mail, also known as “spam”. However, it is a distinct and separate problem from spam detection.
In the prior art, Bayesian methods are used in numerous systems to identify spam e-mail messages (e.g. POPFile). Recommind, Inc., San Francisco, Calif., has an e-mail categorization product that uses a Bayesian approach, but does not proactively prevent data-loss events. U.S. Pat. No. 7,376,618 by Anderson et al. (hereinafter Anderson '618), teaches detecting and measuring risk with predictive models using content mining. Anderson '618 uses Bayesian methods to estimate the risk of commercial transactions (i.e. fraud) while categorizing such transactions. Similar techniques are being used to perform prior-art searches automatically (see US Patent Publication No. 20080086432 by Schmidtler et al. on data classification methods using machine learning techniques).
Code Green Networks, Inc., Sunnyvale, Calif., uses a Bayesian method to detect documents such as resumes, source code, and financial statements, relying on a pre-defined “dictionary” of reference terms. Preparing such a dictionary in advance is problematic for the reasons mentioned above. Reconnex Corporation, Mountain View, Calif., also uses a Bayesian method to detect document similarity as disclosed in US Patent Publication No. 20070226504 by de la Iglesia et al. for signature match processing in a document registration system.
Seo et al, in IEEE International Conference on Intelligence and Security Informatics (ISI 2006), May, 2006, pp. 117-128, discusses the use of statistical document classification in the context of access control decisions when users need to access document repositories. Titus Labs, Ottawa, ON, Canada, and Liquid Machines, Inc., Waltham, Mass., have also developed a joint DLP solution that uses document categorization.
CipherTrust, Inc. (owned by Secure Computing Corporation, San Jose, Calif.) uses machine learning to categorize/classify documents. The documents are subsequently used to enforce security policy, including outbound e-mail Categorization by CipherTrust does not distinguish between “internal” and “external” document classes (see U.S. Pat. No. 7,124,438 by Judge et al. on systems and methods for anomaly detection in patterns of monitored communications).
It would be desirable to have methods for automatic categorization of internal and external communication for preventing data loss, inter alia, using statistical analysis of textual data with no a priori information, to categorize messages (e.g. “internal” or “external”), detect, and prevent data-loss events.