Organizations, especially large enterprises, have been making significant, ongoing investments and efforts in preventing sensitive data from being leaked (e.g., from production environment) in order to fulfill the responsibility and requirement of protecting customer and internal data. For example, companies who receive any personally identifiable information (PII), are obligated to safeguard such information pursuant to privacy laws and/or consumer protection laws. At the same time, there is often a need to extract live or production data into a non-production environment, such as UAT (User Acceptance Test) or SIT (System Integration Test), in order to perform meaningful testing on the application or system being developed.
The types of data to be extracted and/or tested often come with different structures, formats, and constraints, and such electronic data typically come from a wide range of sources such as relational databases, data warehouses, big data platforms, as well as unstructured or semi-structured data files. Often, there is no simple way to securely and consistently mask sensitive data along the data path within and between applications and systems. While sensitive data has to be masked securely, there are other requirements on the output of the masked data. For example, the format of the masked data should be preserved, referential integrity of records should be maintained, and data validation rules should not be violated. On the other hand, it is often desirable to also support multilingual masking for multi-byte characters (e.g., Chinese and Japanese characters). These are some examples of requirements that often must be fulfilled at the same time in order to generate meaningful test results based on data coming from multiple upstream sources at multiple intervals. The same also applies to output data that might be consumed by downstream applications or systems.
In prior data-masking approaches, separate masking algorithms are often applied individually on a field by field basis, and a set of different masking rules have to be defined for each masking pass in order to mask data securely and at the same time maintain relationship among data elements. As a result, the process has been tedious, inefficient, and error prone, which often leads to unintended data leakage.
In light of the various deficiencies and problems with existing data-masking methods, there is a need for improved techniques that could securely and reliably mask sensitive data without affecting their usefulness in application or system testing.