Enterprises typically rely on large databases of privacy-sensitive confidential information (financial, consumer, employee, etc.) for their key operational processes. Outside the production environment, similar databases are also needed for a number of other purposes such as software application development, testing, training, demonstrations, data mining and research. While production environments are usually highly protected with extensive security measures (firewalls, passwords, encryption, etc.), non-production environments are often times less secure. Accordingly, they are extremely vulnerable to potential data theft and/or unnecessary confidential data disclosure, especially in cases where companies simply use a copy of the original database without any security or privacy protection.
As such, data masking has been developed to provide desensitized (i.e. “masked”) data for use in non-production environments, so that activities such as software development and testing, employee training, demonstrations, data mining, research, trapping potential data thieves using masked data as a “honey-pot”, etc. can be performed without the risk of exposing privacy-sensitive information. However, as is highlighted below, certain data masking solutions are associated with various significant limitations.
For example, one data masking approach consists of using data generators to generate scrambled characters, such as replacing the name ‘David Smith’ with the sequence ‘kajgt 48hgaso’. This replacement sequence is certainly desensitized, but does not have the same properties as the original name (e.g. it may not have the same character set, capitalized first letter, different length, hyphenated names, etc.).
Another example data masking approach involves the use of finite sets (i.e. pre-determined lists of values that will be used in place of the confidential data). This approach provides more realistic results than the scrambled character generation approach discussed above, but is certainly more limited in terms of the validity and completeness of the resultant masked database. This approach is also limited in terms of localization (e.g. the customer might have a Spanish name but the finite set only contains English names).
Another example data masking approach is utilized when enterprises attempt to mask data themselves (other than generating or using test sets, as discussed above) whereby they follow a very labor-intensive process of identifying all relevant fields in the original database and manually assigning new values or basic transformations to them. However, this approach comes with extremely high costs to produce and maintain the resultant database.
Accordingly, there is a need to better protect confidential and other privacy-sensitive information as it is stored and used within the enterprise for such non-production purposes as application development, testing, training, demonstrations, research and data mining, while maintaining the structure and form of the protected information.