Data Masking is one of the essential requirement in areas where customer sensitive information needs to be protected from unauthorized access. In data masking, the customer sensitive data is replaced with fictitious values before being shared for testing activities. At the same time, the masked output should maintain data variations, data distribution, data privacy, look and feel of original data, data integrity, and data consistency for flawless data testing. Data may also contain phonetically similar words which may sound same but are spelt differently. For an example, often multiple variations of a person's name are observed in such data due to data entry from multiple sources within an enterprise.
A majority of existing solutions relies on masking variations in input data with altogether different names. Resultant, the masked output will be unique, random or consistent. Some of the prior art literature vaguely describe about a masking system to mask phonetically similar data by replacing all the variants of input data with a same masked value; wherein the variance of original production data is removed by changing the data distribution post masking. However, prior art literature has never considering the variants of input data as a part of a single group, which are phonetically similar. In addition, prior art literature has never explored about masked output maintaining look and feel of original data including data variation by distributing the dataset value according to its phonetic properties.
In addition, the prior art literature requires to maintains a list of words and their phonetic equivalent, thus for any new data, the mapping has to be added in the list before processing the data. However, prior art literature has never explored eliminating need to maintain the map of original data and its metaphone, wherein the metaphone are generated at runtime, thus removing the possibility of backward traceability of original data. Prior art literature also restricts existing data masking systems to be executed only on file and voice. However, prior art literature has never explored extending the same to different data sources, such as RDBMS database, mainframe files, common files, log files, pdf, doc, docx etc. Prior art literature is also silent on providing combination of metaphone generation and masking of phonetically similar words for maintaining the integrity and consistency of data enterprise wide.
Prior art literature have illustrated various data masking tools and techniques, however, generating a group of phonetically similar masked data, wherein masked output maintains look and feel of original data including data variation is still considered as one of the biggest challenges of the technical domain.