Data masking, or redacting, is an important data management technology which prevents access to sensitive data by unauthorized users. Data masking may be applied to stored data at any time, applied when data elements are changed in the persistent data store, or applied to the data while it is in transit wherein data elements are changed while being transmitted to the data consumer.
Data masking techniques include masking data reversibly. Reversible data masking allows recovery of the original data from its masked representation. Data element encryption is an example of a reversible data masking technique. Irreversible data masking, alternatively, transforms the original data element in such way that its original content is wholly or partially lost. For example, one irreversible masking technique extracts a substring of a character string and replaces the remaining characters with arbitrary values.
Traditional data masking is not application friendly. When traditional data masking techniques, such as partial redacting, are applied the applications produce different results than they would with original unmasked data elements. This is especially so when sensitive data is syntactically defined as, for example, a formatted data string such as a driver's license number stored as a data element such as PA12345678, where the first two data element members represent the state of issue and is limited to a set of fifty two-letter state identifiers. In such a case, a masking that results in a data element ZX87654321 received by an application might result in errors during processing if the application expects one of the fifty state identifiers. Or for example, a query on a data set comprising data elements each having the first 12 digits of a credit card number masked (for example xxxx-xxxx-xxxx-1234) may produce different result than a query on an unmasked data set due to possible duplicate credit cards with same last four digits of the account number.
Format preserving encryption technology (“FPE”) exhibits certain desirable properties, but has difficulty (or is entirely incapable of) handling data elements having specialized format transform rules, and requires the management of sensitive cryptographic material. For example, a California license plate has a syntactically constructed format such that the first member of the California license plate is a digit between two and seven, the next three members are letters, and the last three members are digits between zero and nine. FPE is incapable of performing a semantically correct transformation of a complex data element such as a California license plate number due to the independence between the data object components. For example, the three letter code cannot be derived from the serial number value and vice versa. Any attempt to adjust the three letter code to achieve semantic correctness of the license plate number leads to the loss of original information during decryption or requires additional information stored in the database which effectively increases the size of the protected data objects in the database.
Accordingly, improvements are needed in systems for masking data while preserving formatting in a deterministic fashion such that each instance of an original data element when transformed by the data masking system under the same conditions results in the same masked data element having the same format.