A modern organization typically maintains a data storage system to store and deliver records concerning various significant business aspects of the organization. Stored records may include data on customers (or patients), contracts, deliveries, supplies, employees, manufacturing, or the like. Stored records are typically hosted by a computer connected to a local area network (LAN). This computer is usually made accessible to the Internet via a firewall, router, or other packet switching devices. Although the accessibility of the records via the network provides for more efficient utilization of information, it also poses security problems due to the highly sensitive nature of this information. In particular, because access to these records is essential to the job function of many employees in the organization, there are many possible points of potential theft or accidental distribution of this information. Theft of information represents a significant business risk both in terms of the value of the intellectual property as well as the legal liabilities related to regulatory compliance.
A significant part of confidential information consists of well defined personal identifiers such as credit card numbers, social security numbers, account numbers, employee numbers, customer or patient numbers, IP addresses, driver license numbers, license plate numbers, etc. These personal identifiers typically contain digits and numbers grouped together in a well defined format. However, for each personal identifier, the format may have multiple variations. For example, a social security number may be written as a nine digit number or may have spaces or dashes as delimiters. A credit card number may have up to 35 variations. Except for these variations, the format is usually very rigid, consisting of a fixed number of digit and letter combinations in a certain order.
Existing pattern detection technologies, such as regular expression implementations, are not optimized towards the rigid pattern formats and their variations. As a result, memory or CPU performance might decrease with the high number of variations. In addition, existing pattern detection technologies are not very accurate and produce a significant number of false positives.