Data masking is an essential component for privacy-preserving data sharing. Across various industries, sensitive data may be shared or disseminated beyond secure corporate boundaries. For purposes of illustration, such data may be related to customers, patients, suppliers, or vendors. Initiatives such as outsourcing and off-shoring have created opportunities for sensitive data to be accessed by unauthorized parties, thereby placing individuals' confidentiality at risk. In many cases, these unauthorized parties do not need to access true, actual, or accurate data values in order to properly conduct their job functions. Examples of sensitive data include, but are not limited to, names, addresses, network identifiers, social security numbers, medical information, and financial data. In an effort to ensure privacy, organizations and institutes around the world perform data masking to ensure that sensitive values are not disclosed in the form of person-specific data. Some existing data masking solutions are tailored to operate in a single-machine (single thread) environment, whereas other solutions are designed for a distributed environment. A useful feature in data masking regards an offering of consistency, where multiple appearances of an original data value in a dataset become transformed to an identical masked data value.
Existing single-machine (single-thread), consistent data masking solutions cannot be ported to a distributed environment, which limits the scalability and usability of these schemes. Existing distributed solutions, on the other hand, provide consistency by supporting only the application of a very limited set of masking operations such as deterministic hashing, encoding, or encryption using predefined keys. Yet another conventional consistent data masking approach uses static dictionary-based mapping in a distributed computing environment, where a given original value is always masked using the same pre-determined masked value. However, all of these conventional approaches to the offering of distributed consistent data masking are highly detrimental to the usefulness and statistical value of the masked data. For example, masking an original value, such as the name of a street, by always using a pre-determined masked value of “abcd”, fails to retain any useful characteristics of the original data.
Conventional data masking techniques are often implemented independently in an ad hoc and subjective manner for each of a plurality of applications on a distributed computing system. Such an ad hoc data masking approach requires time-consuming iterative trial and error cycles that are not repeatable. Moreover, multiple subject matter experts using the aforementioned subjective data masking approach may independently develop and implement inconsistent data masking techniques on multiple interfacing applications. These multiple applications may work properly and effectively so long as the applications are operated independently of one other. However, when data is exchanged between multiple interfacing applications, data inconsistencies introduced by the inconsistent data masking techniques may cause operational and functional failures.
Conventional data masking approaches present difficulties in terms of properly and completely testing software applications. Due to the fact that conventional masking approaches simply replace sensitive data with non-intelligent and repetitive data, the masked data is not meaningful or useful. For example, assume that a conventional masking approach replaces all alphabetic characters with the letter “X” and all numeric characters with the number “9”, or replaces selected characters with other characters that are selected using a randomization scheme. Because the masked data is no longer meaningful or useful, some of the logical paths in the application cannot be tested (i.e., full functional testing is not possible), leaving the application vulnerable to error when true data values are introduced in production. Thus, there exists a need to overcome at least one of the preceding deficiencies and limitations of the related art.