The present invention relates to data validation, for example in the context of data loss prevention, and, more particularly, to a method and device that use statistical heuristics to reduce the number of false positives.
Data is more accessible and transferable today than ever before, and the vast majority of data is sensitive at various levels. Some data is confidential simply because it is part of an internal organization and was not meant to be available to the public. Leakage of data could be embarrassing or, worse, cost the organization an industrial edge or loss of accounts.
Records of personally identifiable information, sale reports and customer information are examples of very sensitive information. Furthermore, many organizations are subject to regulatory compliance liabilities.
Data Loss Prevention (DLP) systems identify, monitor, and protect data transfer through deep content inspection and analysis of transaction parameters (such as source, destination, data, and protocol). In short, DLP detects and prevents the unauthorized release of confidential information.
DLP systems commonly use several analysis methods such as: file signatures, keywords search, pattern matching (regular expressions) and other sophisticated techniques to identify and recognize an organization's confidential data.
Pattern matching is used to search data that has a predefined structure, for example: credit card numbers, which are commonly 16-digit numbers. Credit card numbers also have a check digit, so not every 16-digit number is a valid credit card number. Running the credit card check digit validation function (Luhn-mod 10 checksum) determines if the 16-digit number might be a credit card number. Note that not all 16-digit numbers that pass the mod-10 validation function are valid credit card numbers. However, all valid credit card numbers pass the mod-10 validation function.
This credit card number example illustrates the fact that when using the pattern matching method, an additional validation may be used when applicable. Such validation method improves the accuracy of a DLP system in examining classified data.
Pattern matching only checks for certain structures of the digits, for example groups of four digits followed by a delimiter or digits that match a prefix in a list of prefixes corresponding to different credit card issuers. Data is “validatable” if, beyond the pattern matching method, a validation procedure exists that determines whether the data is correctly identified as sensitive (like the mod-10 calculation in the credit card example).
However, using patterns with a validation function may still generate false positives, thus increasing the administrative burden of managing a DLP system and decreasing the effectiveness of such a system. For instance, assuming that a DLP system searches for 20 national IDs that are 9 digit numbers with a mod-10 validation function, and that the examined text is a phonebook containing 9-digit phone numbers, it is very likely that some of the phone numbers that match the 9-digit pattern will also match the mod-10 calculation and thus will be considered by the DLP system to be valid national ID numbers.
It would be highly advantageous to have a DLP method and system that is more robust relative to false positives than known DLP methods and systems.