Recent years, in various services, privacy information related to individuals has been stored in information processing devices. Such privacy information includes, for example, purchase information or medical treatment information of individuals. For example, a medical prescription receipt ([German: Rezept]), which is a medical care fee claims' bill, is a data set composed of records with attributes on a patient and treatment (for example, year of birth, gender, disease name, and drug name) and stored in an information processing device.
From the viewpoint of privacy protection, it is not desirable that such privacy information is disclosed or used while remaining unchanged from original information contents.
Attributes which characterize an individual and have a possibility of specifying an individual from a combination thereof, such as year of birth and gender, are referred to as “quasi-identifier”. Attributes which an individual do not want others to know, such as disease name and drug name, are referred to as “sensitive attribute (sensitive information: Sensitive Attribute (SA), or Sensitive Value (SV))”.
An attribute including a single value, such as year of birth or gender, is referred to as “single-valued attribute”.
An attribute, which may include a single value or a plurality of values (set value) such as disease name or drug name, is referred to as “set-valued attribute”.
A data set including privacy information is information, the secondary use of which is significantly beneficial, unless there is concern for privacy invasion. The secondary use means that privacy information is provided to a third party other than a service provider who generates and stores the privacy information, and the third party who is provided with the information uses the information. Alternatively, the secondary use means that a service provider provides a third party with privacy information and outsources work, such as analysis, to the third party.
Secondary use of privacy information promotes analysis and research of the privacy information and makes it possible to enhance service by using an analysis result and a research result. Furthermore, secondary use of privacy information makes it possible for a third party to enjoy a significant benefit which the privacy information has.
For example, a pharmaceutical company may be supposed to be a third party. A pharmaceutical company is able to analyze a co-occurrence relation or a correlation of pharmaceuticals based on treatment information. However, it is difficult for a pharmaceutical company to obtain treatment information. If a pharmaceutical company obtains treatment information, the pharmaceutical company is able to know how pharmaceuticals are used and further analyze effectiveness of the pharmaceuticals.
However, active secondary use of a data set including privacy information has not been carried out due to concern for privacy invasion.
For example, it is assumed that a data set composed of records each including a user identifier (user ID (identifier)) which identifies a service user uniquely and one or more pieces of sensitive information is stored in an information processing device of a service provider. If the user identifier and the sensitive information are provided to a third party, the third party, by using the user identifier, is able to specify the service user who is related to the sensitive information. Therefore, a problem of privacy invasion may be occur.
A case in which, in a data set composed of a plurality of records, one or more quasi-identifiers are given to each record is supposed. In this case, there is a possibility that an individual who is related to the data can be specified based on a combination of the quasi-identifiers. In other words, when an individual can be specified based on a combination of quasi-identifiers even for a data set from which user identifiers are removed, privacy invasion may be occur.
As a technology to convert a data set including privacy information to a form in which privacy is protected while maintaining usefulness, anonymization is known.
In relation to the anonymization, “k-anonymity”, which is one of the most well-known anonymity indices, has been proposed (for example, refer to NPL 1). A technology satisfying k-anonymity according to an anonymized target data set is referred to as “k-anonymization”. The k-anonymization converts target quasi-identifiers so that at least k or more records having the same quasi-identifiers exist in the anonymization target data set. As a conversion process, for example, “generalization” or “suppression” is known. The generalization is a process to convert original information to abstracted information. The suppression is a process to remove original information.
As a related technology, which uses the k-anonymization technique, a technology to encrypt and store data received from a user terminal, converts decrypted data so as to satisfy k-anonymity, and transmits the converted data to a server of a service provider, has been proposed (for example, refer to PLT 1).
As another technology using the k-anonymization technique, a method, which uses a set of records (hereinafter, referred to as “cluster”) including similar attribute values, has been proposed (for example, refer to PLT 2 and NPL 2). This method generates clusters including similar attribute values successively, and, in records included in the clusters, generates common attribute values by using generalization or suppression.
The related technologies disclosed in the above-described PLT 1, PLT 2, and NPL 2 perform k-anonymization to single-valued attributes.
However, there is a case in which, in a data set composed of a plurality of records, an individual can be specified based on a combination of sensitive information given to respective records. That is, there is a case in which an individual can be specified based on a combination of sensitive information even in a data set in which a user identifier is removed and quasi-identifiers are anonymized. Thus, privacy invasion can also occur based on a combination of sensitive information. As described above, a sensitive attribute may become a cause of individual specification as with a quasi-identifier. Therefore, it is also necessary to handle a sensitive attribute in a similar manner to a quasi-identifier.
However, if all sensitive information is removed from a data set, information loss can be caused. As a result, benefit of a data set including privacy information is lost. For example, when a data set of treatment information from which all sensitive information is removed is used, it is difficult to carry out an analysis of correlation and co-occurrence between a disease and another disease.
Thus, an anonymization technology for a set-valued attribute indicating such sensitive information has been proposed (for example, refer to NPLs 3 to 6).
For example, a related technology described in NPL 3 carries out, so that the number of records associated with a combination of items (attributes) included in sensitive information is k or greater number, “local generalization” of items. The local generalization in the above description is a method to adjust the degree of generalization required for k-anonymization with respect to each record. Local generalization can reduce the degree of generalization (information loss). The related technology requires taxonomy for generalization. Further, this related technology has a problem in that unevenness in generalization, such as a certain attribute value is processed to generalized values which differ with respect to each record, is generated, and totaling is difficult.
A related technology described in NPL 4 carries out “global generalization (global recoding)” of items so that the number of records associated with a combination of items included in sensitive information becomes k or greater number. The global generalization in the above description is a method to determine what kind of value a certain attribute value is generalized to by considering k-anonymity and information loss of the whole of a data set. For example, it is assumed that taxonomy illustrated in FIG. 14 exists for values taken by a set-valued attribute referred to as disease name. When it is required that, to satisfy a desirable anonymity, a value of “A” included in the disease name attribute of a record is generalized to “nervous system”, this related technology generalizes all disease names “A” in the data set to “nervous system”. As described above, this related technology has a problem in that information loss of attribute values becomes excessively large.
A related technology described in NPL 5 carries out “global suppression” of items so that the number of records associated with a combination of items included in sensitive information is k or greater number. The global suppression in the above description is a method to determine whether or not a certain attribute value is removed by considering k-anonymity and information loss of the whole of a data set. This related technology removes an attribute value determined to be removed so as not to exist in the data set. Thus, this related technology has a problem in that the number of removed items is likely to increase.
A related technology described in NPL 6 carries out global generalization and removal of items so that the number of records associated with a combination of items included in sensitive information is k or greater number. This related technology requires taxonomy for generalization. This related technology can reduce information loss smaller than the technologies described in NPLs 4 and 5, and does not generate unevenness in generalization generated by the technology described in NPL 3.