Due to recent technological advances, individuals and organizations may quickly and easily share, access, and disseminate high volumes of digital information. For many individuals and organizations, the ease with which information may be electronically disseminated is empowering. However, the ubiquity of high-speed Internet access, smart mobile devices, and portable storage devices may pose unique challenges for individuals and organizations concerned with preventing the loss and/or exposure of sensitive data. Individuals and organizations are therefore increasingly looking to data loss prevention (“DLP”) solutions to protect their sensitive data.
In order to identify sensitive structured data in unstructured text, typical DLP solutions may maintain an index of the sensitive structured data and search the index for tokens that appear in the unstructured text. In order to maintain the security of the sensitive structured data, such indexes may contain salted cryptographic hashes of sensitive data rather than the sensitive data itself. Unfortunately, for large structured datasets, the number of values indexed can be in the billions, resulting in an index that may be dozens of gigabytes or more in size. Moreover in order to enable efficient lookup, such indexes may be completely stored in memory, which may limit the environments where such indexes may be deployed to dedicated servers with abundant memory. Furthermore, distributing the indexes beyond tightly controlled server environments may also present security challenges. For example, even though indexed values may be cryptographically hashed, the indexed values may be limited to a relatively small set such that a dictionary attack may be quite feasible. The instant disclosure, therefore, identifies and addresses a need for improved systems and methods for searching unstructured documents for structured data.