Organizations, especially large enterprises, have been making significant, ongoing investments and efforts in preventing sensitive data from being leaked or stolen in order to fulfill the responsibility and requirement of protecting customer and internal data. For example, companies who receive any personally identifiable information (PII), are obligated to safeguard such information pursuant to privacy laws and/or consumer protection laws. At the same time, there is often a need to extract live or production data into a non-production environment, such as UAT (User Acceptance Test) or SIT (System Integration Test), in order to perform meaningful testing on the application or system being developed. In general, sensitive data need to be passed among different parts of an organization's IT infrastructure where the transmission paths are not always fully secured.
The types of data to be extracted and/or tested often come with different structures, formats, and constraints, and such electronic data typically come from a wide range of sources such as relational databases, data warehouses, big data platforms, as well as unstructured or semi-structured data files. Often, there is no simple way to securely and consistently mask sensitive data along the data path within and between applications and systems. While sensitive data has to be masked securely, there are other requirements on the output of the masked data. For example, the format of the masked data should be preserved, referential integrity of records should be maintained, and data validation rules should not be violated. On the other hand, it is often desirable to also support multilingual masking for multi-byte characters (e.g., Chinese and Japanese characters). These are some examples of requirements that often must be fulfilled at the same time in order to generate meaningful test results based on data coming from multiple upstream sources at multiple intervals. The same also applies to output data that might be consumed by downstream applications or systems.
In prior data-masking approaches, separate masking algorithms are often applied individually on a field by field basis, and a set of different masking rules have to be defined for each masking pass in order to mask data securely and at the same time maintain relationship among data elements. As a result, the process has been tedious, inefficient, and error prone, which often leads to unintended data leakage.
Furthermore, prior tokenization-based data protection approaches tend to use large token databases which impose significant overhead upon the underlying systems and/or applications. Typically, those prior solutions are costly to implement, not scalable or cloud-native, and also unable to support multiple types of applications at the same times.
In light of the various deficiencies and problems with existing data protection methods, there is a need for improved techniques that could securely and reliably protect sensitive data in a more efficient manner.