1. Field
This disclosure is generally related to content detection systems. More specifically, this disclosure is related to enhancing the performance of a cut-and-paste attack detection system by establishing a non-sensitive-passage database.
2. Related Art
To safeguard a company's sensitive information, such as trade secrets and un-released financial reports, an automated system is often installed to monitor outgoing emails from the company's corporate email accounts in order to detect cut-and-paste attacks. Such attacks occur when sensitive material is “cut” out of one document and “pasted” into another. By recognizing sensitive materials included in the outgoing emails, either accidentally or intentionally, the cut-and-paste detection system is able to flag emails that contain sensitive materials.
In order for the cut-and-paste attack detection system to function properly, such a system needs to be trained beforehand so it can recognize sensitive materials. In order to train the detection system, a system administrator, or a person in the company responsible for detecting such attacks, provides the system with a number of sensitive documents as training documents. Because often there is no indication of which parts of the training documents are sensitive or why they are sensitive, the system fingerprints (for example, by generating hash values) the training document in their entirety, paragraph by paragraph, or sentence by sentence, and stores the resulting fingerprints.
During operation, the system compares the fingerprints of an outgoing email with stored document fingerprints to detect sensitive materials contained in the email. Using this technique, the system can effectively detect any paragraphs which refer to the same content or topic but with different words or phrasing, or passages of a given length, that are pasted to an outgoing email from the sensitive documents, because all passages in the sensitive documents are treated as sensitive. However, such an approach has several drawbacks. For example, boilerplates in the training documents (e.g., the company logo, the URL of the company website, and standard “legalese” that states that the company is a privately held entity, etc.) will always trigger the cut-and-paste attack detection system to flag an outgoing email. Such flagging is unnecessary and can consume a great amount of resources of the company since the flagged emails often require manual inspection by the system administrator to determine whether they are safe to be sent. In addition, without knowing which parts of the sensitive documents are sensitive, the system cannot detect other potentially sensitive documents or paragraphs. What is needed is a system that can accurately distinguish non-sensitive passages, such as boilerplates, from sensitive passages within a sensitive document, thus preventing unnecessary flagging of emails that contain only non-sensitive passages.