In the last four years organizations have been experiencing a massive increase in the number of computers, network endpoints, and smart devices that connect to the organizational network. The increase in the volume of the data and the variety of data formats, combined with the development in the sophistication of the methods in which data is being stolen from within the organization have started to become a substantial challenge for companies and their CISO's, Fraud Managers and Risk Managers. To combat this challenge, companies often employ data protection (DP) systems to identify and control access to sensitive data (SD).
Current DP systems on the market today can he divided in two types. The first type of DP system uses classification techniques to scan file contents tot particular strings, keywords or data structures which are then used to classify the files as containing SD or not. However, in most cases the classification technologies are rather primitive and rely primarily on rule engines in order to find and protect SD. Thus, the responsibility lies with the analyst to be able to define a robust enough set of rules for identifying SD.
More advanced DP systems use statistical fingerprinting technologies or hashing to generate a digital fingerprint of each tile to he scanned, and compare the fingerprint to a fingerprints database containing fingerprints for fries known to contain SD. Statistical fingerprinting techniques typically calculate certain statistical features of the file bytes heap, and use these statistical features to re-identify the same file, including after having undergone some changes. The hashing method generates a single hash number from the file byte heap using common hashing algorithms (MD5, SHA1, SHA256, etc.), which it then uses to re-identify the files.
However these methods of digital fingerprinting lack a sufficient degree of accuracy and have been known to generate a relatively large number of false positives and false negatives. In addition, these methods are not well equipped to handle cases where a file's content is modified in order to avoid detection (e.g. by changing the file format, cutting the file to several smaller files, insertion of data into other files, encryption, obfuscation, etc. ). Significant changes to certain elements of the file will sometimes result in a new digital fingerprint, thus preventing the system of identifying the changed file as a modified version of the original file. In addition, some of these methods generate fingerprints having a byte size that increases with the size of the original file, thus requiring a large amount of storage capacity in order to store the fingerprints.