With the rapid increase and advances in digital documentation capabilities and document management systems, organizations are increasingly storing important, confidential, and secure information in the form of digital documents. Unauthorized dissemination of this information, either by accident or by wanton means, presents serious security risks to these organizations. Therefore, it is imperative for the organizations to protect such secure information and detect and react to any secure information (or derivatives thereof) from being disclosed beyond the perimeters of the organization.
Additionally, the organizations face the challenge of categorizing and maintaining the large corpus of digital information across potentially thousands of data stores, content management systems, end-user desktops, etc. It is therefore valuable to the organization to be able to identify and disregard redundant information from this vast database. At the same time, it is critical to the organization's security to be able to identify derivative forms of the secure data (e.g., changes to the sentence structure or word ordering at the sentence/paragraph level, use of comparable words in the form of synonyms/hpernyms, varied usage of punctuations, etc.) and identify any unauthorized disclosure of even such derivative forms. Therefore, any system or method built to accomplish the task of preventing unauthorized disclosure would have to address these two conflicting challenges.
One method to detect similar data is by examining the database at the file level. This can be done by comparing the file names, or by comparing the file sizes, or by doing a checksum of the contents of the file. However, even minor differences between the two files will evade a detection method.
Other prior art solutions teach partial text matching methods using various k-gram approaches. In such approaches, text-characters of a fixed length, called k-grams, are selected from the secure text. These k-grams are hashed into a number called a fingerprint. In order to increase storage and resource efficiency, the various prior art approaches propose different means by which the k-grams can sampled, so as to store only a representative subset of the k-grams. However, these prior art approaches suffer a number of disadvantages. For example, these prior systems are not robust against derivate works of the secure text. Additionally, the k-gram approaches are not suitable for use in multi-language environments (e.g., a document containing a mixture of Mandarin and English words). Also, using a character-based approach as opposed to a word-based approach does not allow for the exclusion of common or repeated words, thus resulting in overall memory and resource inefficiencies.