Many organizations collect system logs, error reports, and other data for routine business purposes. This data can be mined to create and maintain a competitive advantage, but, because such data may contain user information needed for legitimate purposes, the increased availability and access to such data also exposes companies to increased risk of data misuse; intentional or otherwise.
Scrubbing of personally identifiable information (PII) is a standard technique used to remove user information from logs so that the information can be safely accessed by a wider audience, but scrubbing is costly, limited, and imperfect. Although users may be able to identify their personal information when they see it, it is much more difficult to build a scalable and cost effective system that can do the same thing over the petabytes of data that large organizations collect on a daily basis.
Protection of some (PII) may be required by governmental laws or regulations. Even when protection is not required, organizations may be motivated to protect PII for various reasons such as fostering trust with customers/users or minimizing legal risk. Regardless of the reason, protecting PII is expensive in terms of resources (e.g., processing time, storage space, and development time for PII scrubbers), which ultimately translates into a financial cost to the organization. The time required to process logs potentially containing PII may result in organization personnel having delayed access to logs containing the most current information. If a scrubber for the log is not currently available or needs to be modified, organization personnel may not have access to the logs until the situation is remedied. Depending upon priorities and resources, the lag between determining that a log potentially contains PII and developing a proper scrubber may be several months or longer.
In an attempt to minimize the time until the scrubbed logs are available, organizations may resort to over-scrubbing the logs using “brute-force” scrubbers. These brute-force scrubbers often employ imprecise or destructive techniques, such as scrubbing entire portions of the log that potentially contains PII rather than determining whether the data is actually PII, scrubbing that results in the permanent loss of at least some of original data, and/or replacing distinct data items with a token that covers the entire group.
Over-scrubbing often increases processing time and storage requirements because protection is applied indiscriminately even if the target data is not PII. The reason for the increased storage requirements is that many of the protection techniques produce values that are significantly larger than the data being protected. Indiscriminately protecting portions of the log may easily double or triple the original log size. Another cost of over-scrubbing is the loss of valuable business intelligence originally contained in the logs either because data has been destroyed or has been transformed in a manner that limits the ability to analyze it. For example, over-scrubbing uses a single replacement for an entire component of a message (or even the entire message) instead of replacing the individual pieces that make up the larger component making it difficult, if not impossible, to make any meaningful use of the protected data.
It is with respect to these and other considerations that the present invention has been made. Although relatively specific problems have been discussed, it should be understood that the embodiments disclosed herein should not be limited to solving the specific problems identified in the background.