The following appendix is being filed with this application, the entire contents of which are herein incorporated by reference for all purposes:
Appendix A (30 pages)xe2x80x94xe2x80x9cGuard-ifxe2x80x9d Document
The present invention relates generally to the field of computer data security and content-based filtering of information. In particular, the present invention relates to the application of natural language processing (NLP) and information retrieval techniques to classification of information based on its content, and controlling the distribution or dissemination of the information based on the classification.
With the widespread use of computers, an expanding telecommunication network, and the rising popularity of the Internet, an increasing amount of information is now being stored and communicated/distributed in electronic form for both personal and business purposes. Although increased connectivity has facilitated the free flow of information, it has also created data security problems for organizations and individuals who wish to prevent access to or prevent spread of sensitive information from secure domains to the non-secure outside world. In particular, communication techniques such as electronic mail (E-Mail), electronic faxes, and the like, have made the networks of these organizations susceptible to information leakage problems whereby sensitive information is transmitted to unauthorized users by processes/entities with legitimate access to the information.
For example, a corporation may be very interested in preventing the distribution of inappropriate information such as trade secrets, hate messages, indecent materials, etc. which may expose the corporation to monetary damages, adverse legal action, or even the corporation""s reputation. Government and military organizations may be very interested in preventing leakage of sensitive information from their secure networks to the outside world. Likewise, organizations such as hospitals, banks, and credit agencies may want to prevent the dissemination of patient and client information to unauthorized users.
Traditionally, organizations have attempted to reduce information leakage by employing security personnel who manually monitor the contents of information carrying messages which originate in a secure domain and whose destination lies outside the secure domain or of messages whose sender is a member of a secure domain but whose recipient is not a member of the secure domain. The outgoing messages are allowed to leave the boundaries of the secure domain only if the contents of the outgoing messages do not violate predefined security policies for that secure domain. While this approach is effective in controlling the spread of sensitive information, it is very human resource intensive and thus very expensive.
Currently, a number of security products are available which automate the task of controlling the dissemination of information from a secure domain. These security products are designed to monitor the contents of outgoing messages passing from secure domains and flag those messages which violate security policies. These tools are commonly referred to as xe2x80x9cboundary controllersxe2x80x9d since they monitor the contents of outgoing messages crossing the boundary of a secure domain to the outside world. An example of such a security product is the MINEsweeper product from Integralis (Content Technologies, Inc.).
The boundary controllers described above monitor the contents of outgoing messages based on a xe2x80x9ckeyword listxe2x80x9d or xe2x80x9cdirty wordxe2x80x9d list. The boundary controllers are configured to flag outgoing messages which contain one or more keywords contained in the keyword list or dirty word list. This approach is lexically based and thus can be easily circumvented by using xe2x80x9cinnocentxe2x80x9d words in the outgoing message instead of the xe2x80x9cdirtyxe2x80x9d words. Further, since the nature of sensitive information can change dynamically, the keywords list needs to be continually updated which is administratively cumbersome. Additionally, since the boundary controllers use simple word matching techniques, they cannot take into account that a particular xe2x80x9cdirtyxe2x80x9d word can be used in various different contexts, not all of which should be flagged. Consequently, conventional boundary controllers are often plagued by errors and inconsistencies and as a result cannot assure information security.
Thus, there is a need for a system and method which can provide greater information security than that offered by prior art techniques.
The present invention describes a system, method, and computer program for controlling distribution of a message from a secure domain to a destination outside the secure domain. According to an embodiment, the present invention constructs semantic models for a plurality of message categories and for outgoing messages. The semantic model of an outgoing message is then compared with the semantic models of the plurality of message categories and the outgoing message is classified based on the comparison. The present invention then uses the classification information for the message to determine if the message can be distributed outside the secure domain.
According to an embodiment, the present invention compares the semantic model of the message with the semantic models for the plurality of message categories and determines a degree of similarity between the semantic model of the message and the semantic model for each message category in the plurality of message categories. A message is classified as belonging to a message category if the degree of similarity between the semantic model of the message and the semantic model of the message category exceeds a threshold degree of similarity. The threshold degree of similarity may be user-defined.
According to another embodiment, the present invention determines if the message can be distributed to a recipient outside the secure domain by determining if the message violates a security policy. The present invention may determine a security clearance level for the sender of the message, the recipient, and for the message category to which the message was classified. The present invention may indicate that the message violates the security policy if the security clearance level of the sender or recipient is lower than the security clearance level of the message category. In case of a security policy violation, the present invention may prevent distribution of the message to the recipient. Messages which do not violate any security policies may be forwarded to the recipient
According to another embodiment of the present invention, information about unclassified message is presented to the user via a graphical user interface to facilitate manual classification. The graphical user interface allows a user to manually classify the message. The graphical user interface may also allow the user to indicate if the message violates a security policy.
According to yet another embodiment of the present invention, manually classified messages may be forwarded to a machine learning module which compares the semantic representations of the manually classified message and the message category to which the message was manually classified. The semantic model of the message category may be updated based on the comparison.
The invention will be better understood by reference to the following detailed description and the accompanying figures.