The present invention relates generally to detection of documents that have been duplicated, perhaps without authority, and more particularly to post-facto duplication detection on a public network.
The need to analyze outgoing electronic traffic (“exit traffic analysis”) has been underemphasized by the electronic networking community. Given the ever-growing number of sensitive-data leakage-incidents in enterprises, resulting in hundreds of millions of people being exposed to sensitive-information theft every year, there is a need to develop new exit traffic analysis techniques for data leakage detection and prevention.
Exit traffic analysis to detect data leakage is used in two different ways: a) to prevent leakage and b) to detect leakage after it has occurred (“post-facto”). An important goal of data leakage prevention is to develop a mechanism that will prevent any unauthorized user or process from improperly “leaking” any one of a given set of pre-identified sensitive documents. An important goal of post-facto leakage detection is to develop a mechanism that will determine which sensitive data has already leaked from the enterprise and is publicly available, for example, on the Internet.
The need for post-facto leakage detection—a major focus of this invention—is based on at least two observations. First, currently large amounts of sensitive information are publicly available in the Internet—often without the knowledge of the subject or owner of the sensitive information. For example, in March 2006, Gratis Internet Company collected personal data of 7 million Americans and sold it to third parties. With so much sensitive data available in the public domain, it would be advantageous for the subjects of the sensitive information to have a means for detecting which sensitive information is available and where it may be accessed. Second, given the wide range of leakage channels which are possible, some of which are outside the scope of any prevention strategy, no data leakage prevention strategy is perfect. This just heightens the need for post-facto leakage detection.
A common method for facilitating post-facto leakage detection is to use watermarking. Watermarking generally involves modifying a document in some way to make the document more distinguishable than it was before the watermarking. These modifications may either be visible or invisible to an observer. The watermark is then used to detect a document that has been improperly leaked.
While watermarking does help to distinguish a document, the technique has several weaknesses. First, since watermarking involves adding something to a document, this technique requires recognizing, before a leakage occurs, that a document needs to be watermarked. If the sensitivity of the information is only discovered after the leakage occurs, watermarking will not be an option for post-facto detection—the document will have been leaked before it could be watermarked. Second, watermarking is subject to tampering. A malicious party who seeks to make pirated information indistinguishable may be able to remove an added watermark. At the root of this second weakness is that a watermark is added onto an original document in someway. Since the watermark is a “separate entity” from the data comprising a document, it can be identified and removed, defeating its purpose.
Accordingly, an improved method—beyond traditional watermarking strategies—is needed for detecting the post-facto leakage of sensitive information into a public domain, such as the Internet. The method should be tamper-resistant, meaning that the sensitive electronic document should remain detectable even if it has been partially modified. Additionally, since leakage may occur before a watermark or other unique identifier may have been added to the document, it is advantageous to have a detection mechanism that does not require any modification to the sensitive document. Furthermore, since information in the public domain may be presented statically or dynamically, this method should be versatile in that it is able to detect the sensitive information whichever way it is being presented.