The present invention relates generally to data masking, and more particularly to methods, apparatus and computer program products whereby data is masked at a data provider computer and sent to a (untrusted) data-user computer.
Data masking is a process used when data which includes sensitive information needs to be copied to a less-trusted environment. The purpose of data masking is to de-sensitize the data, so as to hide (or “mask”) sensitive data items, such that the data as a whole remains useful for its intended purpose. For example, a data set may contain information such as social-security numbers, passport data, credit card numbers, health-record details, etc., which should not be leaked to untrusted parties. Typical application scenarios include sending out a data set for statistical analysis, running an application in a testing environment with a realistic workload, collecting service-quality information from customers, and transaction processing. Concerns about data privacy have grown in recent years along with a trend of moving services to third parties and into the cloud. Masking inhibits exposure of sensitive data in untrusted environments and addresses legal issues associated with moving data across borders.
For security, a data masking process should be such that a masked data item does not reveal information about the original, unmasked data. However, usability requires that a masked data set preserves referential integrity. That is, when the same data item occurs multiple times in the unmasked data set, it should be mapped consistently to the same masked value.
Many data masking techniques have been proposed and are in commercial operation today. For example, masking can be performed via hashing. Here, a data item is hashed together with a long-term hash key of the data-provider. The data item is then replaced with the resulting hash value in the masked data sent to the user. Other known methods are based on substitution, shuffling, deletion (“nulling”), obfuscation, or perturbation techniques. However, such methods cannot provide the increasingly-stringent guarantees required for data security. Moreover, these methods do not allow for re-keying of masked data. The relation between a given data item and its masked form never changes and keys used for masking cannot be changed. With hashing, for example, the hash key cannot be changed without breaking referential integrity. For security-critical environments, e.g., financial institutions, regular re-keying operations and periodic updates of the relationship between unmasked and masked data are required. Periodic updates also reduce the risk of exposure when data leaks gradually over time. As masked data sets are often large, performing such an update from scratch, with the data-provider re-masking the complete data set with a fresh key and re-sending to the user, would be highly inefficient.