Many companies maintain databases that include customer or employee information. The information may comprise names, addresses, phone numbers, social security numbers, company names, salaries, and purchase histories. For example, an internet sales company may have a customer database which includes the names, phone numbers, payment methods, and purchase history of customers. In another example, a payroll department may have salary information regarding its employees. Due to the sensitive nature of some of this information, such as payment methods, social security numbers, and salaries, access is typically restricted to a relatively small group within the company.
As is common with software applications, problems may arise that require troubleshooting by computer programmers. When problems occur with software applications that operate on a database having sensitive information, programmers may need to access the sensitive database to troubleshoot the problem. This may lead to sensitive information being viewed by people who do not normally have access to the information. In the payroll example, distribution of salary information may cause internal problems in the company regarding salary discrepancies. In the internet sales example, distribution of payment methods and other personal information such as social security numbers may lead to identity theft. However, to efficiently troubleshoot the malfunctioning software application, programmers need to access the actual data, and, in particular, the actual data distribution (geographic distribution, name distributions, etc. . . . ).
It is known in the art to obfuscate databases though random data substitution, thereby generating a test database. However, random data substitution does not produce an actual data distribution found in natural databases. A method and system are needed to obfuscate at least portions of databases to produce test databases with data distributions that mirror distributions found in actual databases.