The present invention relates to methods for document-to-template matching for Data-Leak Prevention (DLP, also referred to as Data-Loss Prevention).
Protecting corporate intellectual property has become a major concern for many IT (information technology) departments. Organizations are concerned with protecting patents, trademarks, brands, copyrights, trade secrets, and other corporate assets. Today, most corporate information exists in electronic form, potentially accessible to almost any employee. Furthermore, the use of e-mail has become a ubiquitous aspect of retaining such information, making the control of document transmission and distribution even more imperative. Accidental (or intentional) disclosure of confidential information can result in legal damages and/or loss of competitive edge for a company.
The problems facing DLP impose a challenge with regard to how exactly to classify and identify outbound documents. The methods which are used today involve brute-force fingerprinting of the whole corporate data in order to classify each document. The problems with such methods include the following.                (1) The data needs to be stored in a central database. The process of storing and maintaining a large amount of information is time-consuming. This also creates situations in which restricted data can be exposed en masse to internal personnel.        (2) New documents that do not pass through such a fingerprinting mechanism can still be distributed while not being properly classified as a sensitive document.        (3) For security reasons, some corporate documents may not be accessible to such a fingerprinting mechanism which imposes another security vulnerability to such methods.        
Various DLP solutions in the prior art perform aspects of file and paragraph fingerprinting for preventing internal data leakage. Equivio Inc., Kensington, Md., provides an Equivio>NearDuplicates product which detects and groups near-duplicate files, mainly in order to reduce storage usage. The Equivio product relies on algorithms that look for the number of sequential word pairs.
Proofpoint Inc., Sunnyvale, Calif., provides a Digital Asset Security™ module for enabling multiple category document protection: Categories can be defined for different types of documents to secure, each with different access controls and properties. For example, one can create separate categories for internal memos, draft press releases, organizational charts, and price lists. Each category can have its own properties (such as default time after which documents expire) and document similarity-matching thresholds.
Websense Inc., San Diego, Calif., provides a PreciseID™ fingerprinting technology, using a template/boilerplate fingerprint, that improves the accuracy of detection by accounting for false similarity and screens out commonly-recurring text in similar documents, including boiler plates, disclaimers, template descriptions, forms, and contract terms. The technology employs filters to account for “templated” content for reducing false positives associated with basic identification techniques, which often stumble over templated content. This technology only uses document templates to exclude content from being tagged as a data leak.
Glass et al., in US Patent Publication No. 20050060643, discloses a document similarity detection and classification system for spam detection. The system involves manual annotation of “chunks” of a document to point out the salient ones.
Aiken, in U.S. Pat. No. 6,240,409, mentions a method based on a procedure known as document fingerprinting. Fingerprinting a document involves computing hashes of selected substrings in a document. A particular set of substring hashes chosen to represent a document is the documents fingerprint. The similarity of two documents is defined as a ratio C/T where C is the number of hashes the two documents have in common and T is the total number of hashes taken of one of the documents. Assuming a well-behaved hash function, this ratio is a good estimate of the actual percentage overlap between the two documents. However, this also assumes that a sufficient number of substring hashes are used. Various approaches have been used in determining which substrings in a document are selected for hashing and which of these substring hashes are saved as part of the document fingerprint.
It would be desirable to have methods for document-to-template matching for DLP. Such methods would, among other things, overcome the limitations of the prior art mentioned above.