The problem of securing organizational databases so that legitimate users can access data needed for decision making, while limiting disclosure so that confidential or sensitive information about a single record cannot be inferred, has received considerable attention in the statistical literature.
Data are numbers, characters, images or other outputs from devices that convert physical quantities into symbols. Data can be stored on media, as in databases, numbers can be converted into graphs and charts, which again can be stored on media or printed. Most of the decision making, in business or other disciplines, requires useful data. A database that contains sensitive or confidential information stores data and therefore it is secured—by encryption using a public key, by encryption using a changing public key, in which case the data is held secure while the public key is changed, or by restricting access to it by the operating system.
A database (DB) application must protect the confidentiality of sensitive data and also must provide reasonably accurate aggregates that can be used for decision making. One approach to achieve this goal is to use a statistical database system (SDB). An SDB allows users to access aggregates for subsets of records; the database administrator (DBA) sets a minimum threshold rule on the size of the subset for which aggregates can be accessed. As an example, in an SDB, if a query returns less than or equal to 89 records, then no information is provided to the user for such a query.
A database that obfuscates data is conventionally known as a secret database. A secret database is ideally efficient (stores the data in an efficient manner with minimal overhead), provides a query language (e.g., SQL) interface, is repeatable, i.e., it returns identical results for identical queries, and protects the confidentiality of individual records.
A secret database may be implemented in a parallel fashion as in a parallel set of query pre and post filters. These may be implemented as distributed hardware components given this ability for the obfuscation to be built to handle very large databases and run the queries against the database in a distributed way.
There are a number of known techniques to obfuscate data. In controlled rounding, the cell entries of a two-way table are rounded in such a way that the rounded arrays are forced to be additive along rows and columns and to the grand total (Cox and Ernst, 1982; Cox, 1987). In random rounding, cell values are rounded up or down in a random fashion; the rows (or columns) may not add up to the corresponding marginal totals. Salazar-Gonzales and Schoch (2004) developed a controlled rounding procedure for two-way tables based upon the integer linear programming algorithm. Gonzales and Cox (2005) developed software for protecting tabular data in two dimensions; this software is uses the linear programming algorithm and implements several techniques for protection of tabular data: complementary cell suppression, minimum-distance controlled rounding, unbiased controlled rounding, subtotals constrained controlled rounding, and controlled tabular adjustment.
Cox (1980) considered the problem of statistical disclosure control for aggregates or tabulation cells and discussed cell suppression methodology under which all cells containing sensitive information are suppressed from publication. Duncan and Lambert (1986) used Bayesian predictive posterior distributions for the assessment of disclosure of individual information, given aggregate data. Duncan and Mukherjee (2000) considered combining query restriction and data obfuscating to thwart stacks by data snoopers. Chowdhury et al. (1999) have developed two new matrix operators for confidentiality protection.
Franconi and Stander (2002) proposed methods for obfuscating business microdata based upon the method of multiple linear regression (MLR); their method consists of fitting an MLR equation to one variable based upon the values of the other variables in the database, using all of the records in the database. There are three potential problems with this approach:                (1) the fitted MLR equations may provide poor prediction, which in turn will lead to very poor resolution in the obscured values,        (2) the amount of obfuscation cannot be controlled, and        (3) since databases are typically very large, this method may not be computationally efficient.        
Muralidhar et al. (1999) developed a method for obfuscating multivariate data by adding a random noise to value data; their method preserves the relationships among the variables, and the user is given access to perturbed data. One potential problem with this approach is that a query may not be repeatable, i.e., identical queries may produce random outputs, in which case one can get very close to the true values by running identical queries a large number of times.
Thus, it is desirable to provide a system and method for obfuscating data so that a data request or query is repeatable, and access is allowed to users of the data while limiting the disclosure of confidential information on an individual has increased.