Field
Embodiments of the present invention generally relate to network and data security technology. In particular, embodiments of the present invention relate to high performance pattern matching for data leakage detection and prevention and preprocessing of data to facilitate data leak prevention (DLP) pattern matching.
Description of the Related Art
One of the primary concerns of every user and organization connected over the Internet in this age of Information Technology (IT) is data security and prevention of data leakage. Data privacy and data leak prevention (DLP) are therefore among the key concerns for any organization as computing devices within a network may contain sensitive data/information that, if not protected effectively, can be transferred to anyone and anywhere across the globe in very little time. Such sensitive data can include information relating to customers, bank account details, credit card details, social security numbers, dates of birth and the like. For an organization, such data can include sales contracts, customer lists, supplier lists, future product details, financial information, deliveries, supplies, medical records, employee details, manufacturing details, Intellectual Properties, Trade Secrets and the like.
Existing systems and methods for DLP generally use pattern matching for identifying sensitive data, and then attempt to prevent its leakage. As there may be thousands of such patterns for different data sets/types, pattern matching and identification of sensitive data can be time consuming, and hence can make transactions slow. Generally, input strings and data patterns to be matched are represented as regular expressions, the processing of which is computationally expensive and can lead to slow performance. Pattern matching for identification of sensitive data becomes more difficult for data types such as Social Security Numbers, Credit Card Numbers, Dates of Birth, telephone numbers, vehicle registration numbers, among others, which may have standard patterns but also have complex requirements for different positions within data streams/strings, for example, and hence require the creation and use of several regular expressions. For example, as of 2011, non-customized California vehicle registration plates use a seven character alphanumeric serial format having an integer value of 0-9 followed by three capital letters and ending with three integer values of 0-9. While a simple regular expression can be defined to identify such a pattern, other states have different serial formats and the serial formats have changed over the years. As such, those skilled in the art will appreciate a large number of regular expressions would be required to identify all possible serial formats used by every state over the years. Meanwhile, due to the complexity, such data types cannot be processed in parallel like other regular pattern matching implementations.
There is therefore a need for improved, high performance pattern matching that is capable of efficiently detecting sensitive data while in use (e.g., endpoint actions), in motion (e.g., network traffic) and/or at-rest (e.g., data storage) to prevent data leakage.