Field
The field of invention relates to systems and methods for efficient and accurate detection of fingerprint information.
Description of the Related Art
Information and knowledge created and accumulated by organizations and businesses are, in many cases, their most valuable assets. Unauthorized dissemination of intellectual property, financial information and other confidential or sensitive information can significantly damage a company's reputation and competitive advantage. In addition, individuals' private information inside organizations, as well as private information of clients, customers and business partners may include sensitive details that can be abused by users with criminal intentions.
Apart from the damage to business secrecy and reputation, regulations within the US and abroad pose substantial legal liabilities for information leakage. Regulations such as the Health Insurance Portability and Accountability Act (HIPAA), the Gramm-Leach-Bliley act (GLBA) and the privacy-protecting laws of various states and nations imply that the information assets within organizations should be monitored and subjected to an information protection policy in order to protect client's privacy and to mitigate the risks of potential misuse and fraud.
A file may be divided into fragments. A subset of the hashes of these fragments may then be used as “fingerprints” of the document. A file may be divided into fragments in one of two ways: 1) division and 2) phrasing. “Division” comprises dividing the file into a subsequence of n items known as an n-gram. The divisions covered by these n-grams may overlap (a condition known as “shingling”). N-grams may be generated by applying a “sliding window” over the text. Each “window” comprises a given number of characters or words, and from the content of each “window”, a hash-value is calculated.
“Phrasing” comprises dividing the content into phrases, using a separator, such as commas, semi-colons or sentence boundaries. A hash-value is calculated from the content of each phrase. The set of hashes may thereafter be post-selected, or “diluted”, in order to reduce storage and enhance performance by selecting hash-values that are divisible by a certain integer p. For example, if p=5, then, on average, one-fifth of the hashes will be selected.
To assess the similarity level between two texts (i.e., documents), each text is first canonized by bringing the document into a standard format used by the detection system (for example by converting the textual content to lowercase Unicode letters, removal of common words (also known as “stopwords”) like “the” and “is” and other “noise”, etc.). Additionally, “stemming” may be performed, which comprises reducing inflected (or sometimes derived) words to their stem, base or root form.
A similarity measure is used to compare two fingerprints of canonized texts. One similarity measure is the Jaccard similarity measure, which defines the similarity between documents A and B as:
                A      ⋂      B                          A      ⋃      B          
Where the intersection |A ∩B| is defined by the number of hashes the fingerprints of the two documents have in common.
However, Applicants have recognized that for at least the reason that fingerprint size is proportional to the size of the fingerprinted content, fingerprinting large amounts of content, in a manner that will facilitate robust identification, requires an allocation of considerable memory resources. It is generally hard to maintain a large repository in the readily available Random Access Memory (RAM). The detection process may also require expensive accesses to disk storage. These memory requirements hamper performance and the problem is particularly apparent when employing fingerprint-based detection at endpoints, such as laptops and desktops.
The present embodiments contemplate novel methods and systems for efficient detection of fingerprinted information, which overcome the drawbacks and inefficiencies of the current methods described above.