Individual data such as a table that stores the individual's information in each row (record) is converted so as to remain a lot of information while consideration is given to protection of the individual's privacy, whereby it may sometimes be secondarily used for market analysis, or the like. One of the techniques for privacy protection during conversion of the individual data is k-anonymization that achieves information anonymization.
Patent Literature 1: Japanese Laid-open Patent Publication No. 2008-33411.
Patent Literature 2: U.S. Pat. No. 7,269,578.
Non-patent Literature 1: Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, Vol. 1, March 2007.
However, the above-described conventional technology has a problem in that consideration is not given to protection of the presence so as not to determine whether specific person's information is included and therefore the privacy protection is insufficient.
For example, an explanation is given of a case where a company X collects individual data from customers, sells the collected individual data to a company Y, and the company Y uses the information that is obtained by analyzing the individual data for market analysis, or the like.
For example, the individual data includes personal information, such as an address, hobby, or disease, as well as the identification data (ID) for identifying a customer.
When the company X sells the collected individual data to the company Y, the company X takes into account customers and would rather not provide a different company, such as the company Y, with data that would infringe the customer's privacy. Meanwhile, the company Y does not need the information on individual customers of the company X but would like to acquire the statistical information as accurately as possible.
One of the conversion methods that satisfy the above needs is the statistics on an attribute basis. For example, the method converts the attribute of “hobby” in the individual data into the information of a frequency distribution, such as hobby={soccer: 5, petanque: 2}. The company Y may understand the overall trend of hobbies although it does not know who has which hobby by referring to the information of the frequency distribution.
However, the statistics on an attribute basis do not provide the relationship between attributes. For example, in a case where the company Y would like to analyze the correlation between “address” and “hobby”, even if it acquires the statistical data on “address” and “hobby” separately, the difficulty arises in analyzing the correlation between attributes. Therefore, the company X conducts conversion that makes it possible to analyze the relationship between attributes.
One of the conversion methods that allows analysis of the relationship between attributes is removing personal identification. FIG. 15 is an explanatory diagram that illustrates removal of personal identification, and it is an example of removing personal identification from individual data. As illustrated in FIG. 15, during removal of personal identification, the attribute (“ID” in the illustrated example) such as the identifier in a row that is included in the individual data is deleted. The company Y may analyze the relationship between attributes by using the individual data from which personal identification has been removed.
However, in terms of privacy protection, only removal of personal identification is sometimes insufficient. For example, the following situation is considered.
The person with ID=P4 does not want other people (the company Y) to know his/her disease.
The company Y knows the person with ID=P4 and knows that only that person likes “petanque” in a city B.
If the above situation is given, the removal of personal identification illustrated in FIG. 15 does not provide sufficient privacy protection. This is because the company Y may identify the fourth row (the only row of (address, hobby)=(city B, petanque)) as the data on the person with ID=P4, and it may determine that “disease” of that person is “progeria”. However, the information that may be obtained by the company Y is the information that the person with ID=P4 does not want to be known. Therefore, the company X conducts conversion based on the assumption that the company Y has the detailed information about people of less than k (>1).
As described above, one of the conversion methods that allow analysis of the relationship between attributes on the basis of the assumption that there is the detailed information on people of less than k(>1) is k-anonymization for achieving l-diversity (hereinafter, k-anonymization).
With the k-anonymization, quasi-identifier (QI), sensitive-attribute (SA), or the like, is received as an input, and the rows with a similar QI value are grouped (converted into the same value, for example) so that the SA value in each group has diversity. SA is a column of data that the individual (row-information provider) does not want it to be known without any good reason, and QI is a set of columns (1 or more columns) of data that may be easily known to the others. Diversity is the property that may determine the frequency distribution of SA values, for example, the property that may determine that the frequency distribution is slightly biased.
FIG. 16 is an explanatory diagram that illustrates k-anonymization. Specifically, in the example of FIG. 16, anonymization is conducted on the individual data that is illustrated in FIG. 15 such that QI={address, hobby}, SA=disease, and diversity “there are two or more types of values”, k=2. As illustrated in FIG. 16, “address” is generalized so that it is divided into two groups with the same QI in a horizontal direction. Furthermore, the SA values in both groups satisfy diversity. That is, the frequency distribution of the SA value in any group has two or more types of values.
Therefore, in the example of the k-anonymization, the relationship between attributes may be analyzed, while consideration is given to privacy protection. For example, although the company Y receives the k-anonymized individual data, it is difficult to specify the row of the person with ID=P4, either one of {4, 5} rows; therefore, it is difficult to specify the disease, either one of {progeria, allotriophagy}. As described above, k-anonymization is a conversion method that allows analysis of the relationship between attributes as well as consideration on the privacy protection.
However, as k-anonymization uses an algorithm that does not consider protection of the presence, the privacy protection is insufficient in some cases. Here, the protection of the presence means that it is difficult to determine whether specific person's information (row) is included in the table. For example, if the company Y knows a person who lives in a government-designated city and who likes petanque, it may determine that the person is not included in the example of FIG. 16. Thus, in the example of FIG. 16, protection of the presence is not provided.
For example, if the table of FIG. 16 has the data on “the resident who receives medical care and makes equal to or less than 5 million yen a year”, and if it is determined that the person who may be known by the company Y is not included in the table, it may be determined that the person makes much money that is more than 5 million yen a year. Therefore, if the presence is not protected, it can be said that the privacy protection is insufficient.