Protecting numerical confidential data from disclosure is an important aspect of security. Such data was once the purview of data gathering and disseminating organizations, such as for example the U.S. Census Bureau. However, with recent advances in information technology, more organizations are storing extensive amounts of data for purposes of analysis using sophisticated tools and techniques. While such analyses are of tremendous value to organizations and individuals making use of the data, there is also risk that the analyses may result in disclosure of confidential information. Consequently, the need for protection of confidential data from disclosure, while still allowing dissemination of at least a portion thereof, has grown.
Disclosure to unauthorized users can be prevented by passwords, by firewalls, and the like. However, authorized users must be provided access to the data as part of their authorization, to allow making use thereof. There remains a risk that the authorized users will use their access to access the data for illegitimate purposes. Such users are often referred to as “snoopers” or “data spies.” It is almost impossible to identify a user a priori as a snooper. The challenge is then to provide users with the requisite access to data to perform legitimate tasks, while still preventing access to confidential information. This creates problems in restricting access totally unlike the relatively straightforward task of preventing access by unauthorized users.
A variety of disclosure limitation techniques are known, and can be broadly classified as masking techniques and query restriction techniques. Masking techniques modify the original data. Users are provided either complete or restricted access to the masked data, and no access to the original data. Performance of masking data methods is evaluated based on the extent to which they satisfy the needs of the legitimate user while preventing disclosure of confidential information to snoopers. Disclosure may occur when the identity of an individual, the exact value of a confidential attribute, or both are disclosed as the result of a query. Disclosure may also occur when sufficient information is provided to allow a user to infer the identity of an individual, the exact value of a confidential attribute, or both with a greater degree of accuracy than possible without access to the data. In the strictest sense, disclosure may be said to have occurred if providing access to data allows the snooper to gain any knowledge regarding confidential information. Accordingly, an optimal disclosure limitation technique must provide legitimate users with unrestricted access to accurate data, while at the same time providing the user with no additional knowledge regarding any portion of the data deemed confidential.
Data masking techniques are known in the art. Of the most utilized conventional procedures, three of them (Perturbation, Imputation, and PRAM) rely on denying the user access to the “true” values of confidential attributes. The techniques either modify the true values (Perturbation and PRAM) or provide simulated or synthetic data in place of the true values (Imputation). These methods are generally effective for their intended purpose. However, acceptance by the user is a significant concern. Because the data provided to the user has been altered from its original form, the user may be more reluctant to accept the data, and to trust any result or analyses derived therefrom.
A fourth method of data masking, data swapping, provides the advantage that users are allowed access to the original, true values of the confidential attributes. Masking is achieved by exchanging the values of attributes between different records, whereby the given value of a confidential attribute does not necessarily belong to that record with which it is associated after swapping. The user is more easily able to understand the process, and acceptance of the data may be higher. Unfortunately, simple data swapping is primarily based on the concept of data tables, and does not directly address the issue of continuous, numerical confidential attributes. For such attributes, existing data swapping methods are primarily heuristic procedures. Data utility is poor, since all relationships between the variables are modified. Further, disclosure risk is high.
Accordingly, a need is identified for an improved method for data masking. The method should minimize disclosure risk, while maximizing user comfort with the data accessed. The method should produce masked data having the same characteristics as the original data, including the same univariate characteristics, the same relationships between confidential variable, and the same relationships between non-confidential variable. Access to the confidential variables should provide the user with no additional information, and minimize the risk of actual or inferential disclosure.