Data leakage prevention systems classify data, commonly, based on the content. Identifying individual pieces of data, such as social security numbers and credit card numbers has a high occurrence of false positives. Format validation for social security numbers can lead to accepting any nine digit number as a social security number, which could actually be a product number, a document reference or something other than a social security number.
Validation for many formats is available, by using an algorithm such as the Luhn formula for credit cards. The Luhn algorithm or Luhn formula, also known as the “modulus 10” or “mod 10” algorithm, is a simple checksum formula used to validate a variety of identification numbers, such as credit card numbers. It was created by IBM scientist Hans Peter Luhn and described in U.S. Pat. No. 2,950,048.
The problem with certain validation methods alone is that data may still pass the validation but still be incorrect, such as a 16 digit number may still pass a Luhn check and not be a credit card, providing inaccurate validation or a false positive (also known as a “type-II error”). Moreover, when individual assertions are used to contribute to the overall assertion of a file, larger files will have a greater probability of false positive. A common approach to reduce the effect of false positives causing an erroneously classified file is to use a threshold value, only considering the overall file to contain a certain data type when more than a threshold number of assertions have been made that contains the certain data type. Raising the threshold too high causes a false negative problem where valid classifications are missed because the threshold was not met. Moreover, with large files, as the problem of incorrect classification re-occurs, there is a greater probability of false positives. Setting threshold values merely obscure the problem, instead of solving it.
Even with the assignment of threshold values to data based on the size of the file may help reduce false positives, there remains a need to provide a method and system in the classification of data that reduce the inaccuracies that occur during the categorizing and identifying of data into classes or fields of data.