With the rapid increase and advances in digital documentation services and document management systems, organizations are increasingly storing important, confidential, and secure information in the form of digital documents. Unauthorized dissemination of this information, either by accident or by wanton means, presents serious security risks to these organizations. Therefore, it is imperative for the organizations to protect such secure information and detect and react to any secure information from being disclosed beyond the perimeters of the organization.
Additionally, the organizations face the challenge of categorizing and maintaining the large corpus of digital information across potentially thousands of data stores, content management systems, end-user desktops, etc. It is therefore important to the organization to be able to store concise and lightweight versions of fingerprints corresponding to the vast amounts of image data. Furthermore, the organizations face the challenge of categorizing and maintaining the large corpus of digital information across potentially thousands of data stores, content management systems, end-user desktops, etc. One solution to this challenge is to generate fingerprints from all of the digital information that the organization seeks to protect. These fingerprints tersely and securely represent the organization's secure data, and can be maintained in a database for later verification against the information that a user desires to disclose. When the user wishes to disclose any information outside of the organization, fingerprints are generated for the user's information, and these fingerprints are compared against the fingerprints stored in the fingerprint database. If the fingerprints of the user's information matches with fingerprints contained in the fingerprint server, suitable security actions are performed.
However, the user has at his disposal myriad options to disclose the information outside of the organization's protected environment. For example, the user could copy the digital information from his computer to a removable storage medium (e.g., a floppy drive, a USB storage device, etc.), or the user could email the information from his computer through the organization's email server, or the user could print out the information by sending a print request through the organization's print server, etc.
Additionally, in many organizations, sensitive data is stored in databases, including account numbers, patient IDs, and other well-formed, or “structured”, data. The amount of this structured data can be enormous and ease of unwanted distribution across the egress points creates security problems for organizations.
The exact data match problem can be thought of as a massive, multi-keyword search problem. Methods for exact keyword match include Wu-Manber and Aho-Corasick. However, these methods are disadvantageous because they do not scale beyond several thousand keywords in space or time.
Full blown databases can be employed for exact data matches, but they do not scale down to Agents residing on Laptops. There are also security concerns with duplicating all the confidential cell data within an organization directly.
A more general approach can be taken where the pattern of each category of structured data is inferred and searched via regular expressions or a more complex entity extraction technique. However, without the actual values being protected, this approach would lead to many false positives.