The present invention generally relates to methodologies for improving the performance of searching data records for encrypted data values containing a search string.
Electronic data is ubiquitous, and applications utilizing electronic data are more and more widespread. One common need for electronic data is to search or filter for data records containing a certain substring, either at the beginning, at the end, or anywhere in the string.
For example, FIG. 1 illustrates a user interface of an electronic health records (EHR) application which allows a user to search for patients having a last name starting with the string “Doe” (this search functionality may be case sensitive or case insensitive). As illustrated, the EHR application has located three patients having a last name starting with the input string.
FIG. 2 illustrates an exemplary “Patients” data table that may be used in accordance with such an EHR application. This “Patients” data table includes patient records for a plurality of patients, including the three located patients which have a last name starting with “Doe”.
FIG. 3 illustrates a simplified conventional methodology for locating the patients having a last name starting with “Doe”. In particular, FIG. 3 illustrates an SQL query which selects a unique patient ID value (PatientGUID) from the “Patients” data table for each record having a last name that matches the string “Doe %”, where the “%” represents a wildcard that is matched by anything (even an empty string “ ”, thus making it match to “Doe”).
In accordance with this simplified conventional methodology, an index can be constructed which is sorted by patient last names, as illustrated in FIG. 4. This sorted index allows a conventional search methodology, such as a binary search algorithm, to be utilized to locate records having a last name value matching the string “Doe %” (i.e. starting with “Doe”). For example, in accordance with a simple binary search algorithm, a middle record of the index could be identified, as illustrated in FIG. 5, and the last name value for this middle record could then be compared to the search string (e.g. “Doe” or “Doe %”), as illustrated in FIG. 6. Because the index is sorted by the last name value, assuming that the last name value of the middle record does not match the comparison value, the comparison can be used to determine which half of the index might contain records matching the comparison value. This process can be repeated to quickly locate any records matching the comparison value without having to compare the last name value of every record, as illustrated in FIG. 7.
It will be appreciated that this is a very simplified conventional methodology for locating records matching a comparison value, but that other methodologies can be used as well. For example, a binary search tree could be used. FIG. 8 illustrates a top of an exemplary binary search tree.
Overall, there exist a wide variety of methodologies for efficiently searching records including plain text values to locate records containing a search string without having to access and compare every single record (e.g. without having to perform an index scan).
Notably, though, security of electronic data is often very important, and electronic data is frequently encrypted. For example, with respect to electronic health record data, it is necessary to encrypt protected health information (PHI). If deterministic encryption is used, then methodologies for efficiently identifying records having a value that is identical to a search string can still be useful, as a search string can be encrypted and compared to stored encrypted values, and an index can be sorted by encrypted values to speed up searching. However, methodologies such as those noted above for efficiently identifying records containing a search string without having to access and compare every single record break down when encrypted data is utilized.
Returning to the previous example, FIG. 9 illustrates a “Patients_enc” data table which includes the same data as the “Patients” data table, only the first name and last name values have been encrypted. Notably, the methodology described above of using an index sorted by last name values is not as useful when the last name values are encrypted, as it is not possible to sort the index by the unencrypted last name values without accessing and decrypting every record, as illustrated in FIG. 10. Thus, a conventional approach for searching for records having a last name value that starts with the string “Doe” would generally involve an index scan requiring accessing and decrypting every record. For example, FIG. 11 illustrates a conventional query which decrypts every record and returns PatientGUID values for those records having a last name value starting with “Doe”.
It will be appreciated, however, that a search operation for encrypted records that requires accessing every record is much less efficient than the above-noted methodologies for unencrypted records which do not require accessing every record. Although many conventional methodologies for efficiently searching for data records having a value that is identical to a search string can remain useful with encrypted data, a need remains for a methodology improving the performance of searching for records having an encrypted value that contains a search string.
This, and other needs, are addressed by one or more aspects of the present invention.