There has been a tendency that data owned by a data owner such as a company or an individual person is increasing and becoming more complicated. In many cases, a data owner owns a large amount of data, but does not have analysis skills or an analysis system for such a large amount of data. The analysis skills here indicates expert knowledge of statistics or analysis tools. The analysis system here indicates an analysis tool and a distribution system capable of analyzing a large amount of data at high speed.
Accordingly, when analyzing a large amount of data to use it efficiently, it has been highly likely that the data owner outsources data analysis to a data analyst who has analysis skills or an analysis system.
On the other hand, it is not desirable to easily provide data to a data analyst in view of privacy protection since the data to be analyzed may include personal information.
An anonymization technique is one of techniques providing data to a data analyst while protecting privacy. The anonymization technique is a generic term indicating techniques for converting part of data to be analyzed so as to avoid an individual from being specified from the data to be analyzed.
There are usually no particular problems with the anonymization technique; however, the technique may be improved in term of evaluating the data addition degree while maintaining anonymity when anonymizing original data, based on the present inventor's analysis.
First, based on the inventor's analysis, the data owner has two requirements.
The first requirement is to prevent leakage of private information by minimizing to the greatest extent possible the data, which has been anonymized (anonymized data), that is provided to a data analyst.
The second requirement is to improve accuracy of analysis results.
Generally, the amount of anonymized data is smaller than the amount of original data. In addition, if the amount of anonymized data to be provided to the data analyst is small, the accuracy of the analysis results is lowered. Accordingly, the first and second requirements are inconsistent with each other.
Indicators to quantitatively evaluate the degree of satisfying the first or second requirement have been suggested. For the first requirement, an indicator to quantitatively evaluate the degree of anonymization has been suggested. This indicator is referred to as “anonymization degree”. K-anonymity is known as an example of an anonymization degree. K-anonymity is satisfied when any one of combinations of attributes of data has k or more records. In general, the amount of leakage of private information is smaller if a value of k is larger.
For the second requirement, an indicator to evaluate the degree of data loss by anonymization has been suggested. This indicator is referred to as “data loss degree”. An example of data loss degree is described in a non-patent document [SPEC] (URL: http://www.meti.go.jp/policy/it_policy/daikoukai/igvp/cp2_jp/common/personal/2009_infra_A_1_External_specifications.pdf), Section 3.10.3.
In order to satisfy both of the first and second requirements, it is necessary to maximize the anonymization degree, and to minimize the data loss degree to the greatest extent possible.
In order to satisfy the second requirement, a larger amount of data is more preferable. Accordingly, it is preferable to add (increase) data later. Patent document 1 (Jpn. Pat. Appln. KOKAI Publication No. 2010-97336) discloses a system in which the degree of avoiding a decrease of anonymization degree is evaluated when data is added, and if the degree falls below the standard that the data owner has determined, a warning is issued.
In this system, it is preferable to not provide added data to the data analyst when the anonymization degree falls under the standard to satisfy the first requirement. Thus, there is an inconvenience in the prior art that only data that does not decrease the anonymization degree can be added, and the accuracy of the analysis results cannot be improved by increasing the amount of data.
This inconvenience will be explained by using specific data as shown in FIG. 15 to FIG. 21.
Data (original data) D that the data owner originally owns is stored in a table containing four items; age, gender, address, and disease, as shown in FIG. 15. It is assumed that the distribution probability of a disease for age, gender, and address is analyzed by using the data. The items to be anonymized are age, gender, and address. That is, it is assumed that an individual cannot be specified by the disease.
To prevent leakage of private information (first requirement), the original data D is anonymized to obtain anonymized data DA1 as shown in FIG. 16, so that the anonymization degree satisfies k=2. It is apparent that k=2 since FIG. 16 shows that two or more rows are present for each of the combinations of age, gender, and address, which are items to be anonymized in the anonymized data DA1. When deriving the anonymized data DA1 from the original data D, two anonymization techniques are used.
One of the techniques is age grouping (in 5-year increments). In the technique of grouping age, age in the original data D is altered to a 5-year interval including the actual age, as shown in the column of age of the anonymized data DA1. For example, age “25” of the top of the table of the original data D is altered to “25-29” including the actual age “25”.
Another technique is generalization of address (to prefectures). The address has a hierarchical structure, as shown in FIG. 17. In the technique of generalizing the address, a value in municipality level La3, which is the lowest level of the original data D, is altered to a value in prefecture level La2, which is the second lowest level. For example, the address “Yokohama city” of the top of the table of the original data D is altered to “Kanagawa” which is one level higher than “Yokohama city”.
Analyzing the anonymized data DA1, the following results can be obtained:
A person applicable to an attribute combination of “25-29, female, Kanagawa”, has a 50% probability of having a cold, and a 50% of probability of having influenza.
A person applicable to an attribute combination of “30-34, male, Kanagawa”, has a 100% probability of having rubella.
The attribute combination of “25-29, female, Kanagawa”, indicates a record of a person whose age is 25-29, whose gender is female, and whose address is Kanagawa.
However, it is possible that the analysis results of the anonymized data DA1 may have low accuracy since the population for each attribute is two or three people. It is assumed that new data D′ is added, as shown in FIG. 18, in order to increase the accuracy of analysis results (second requirement).
Similar to the above, to prevent leakage of private information (first demand), the new data D′ is anonymized to obtain anonymized data DA3, as shown in FIG. 19, so that the anonymization degree satisfies k=2. When deriving the anonymized data DA3 from the new data D′, two anonymization techniques are used.
One of the techniques is age grouping (in 5-year increments), similar to the above. However, age is grouped by 5-year intervals starting from age of 23 in the anonymized data DA3, instead of starting from age 25.
Another technique is generalizing gender (any). The gender has a hierarchical structure, as shown in FIG. 20. In the technique of generalizing gender, a value in male/female level Ls2 which is the lowest level of the new data D′ is altered to a value in undefined (any) level Ls1 which is one level higher than Ls2.
Analyzing the anonymized data DA3 of the new data D′, the following results can be obtained:
A person applicable to an attribute combination, “23-27, any, Kanagawa”, has a 50% probability of having a fracture, and a 50% of probability of having influenza.
A person applicable to an attribute combination, “28-32, any, Kanagawa”, has a 67% probability of having rubella, and a 33% of probability of having a cold.
However, the analysis results of the anonymized data DA3 are obtained based only on the new data D′, but not based on the combination of the new data D′ and the original data D. This is because the new data D′ and the original data D adopt different anonymization techniques, and accordingly, it is difficult to integrate these data. Therefore, the population is not increased even by adding the new data D′, and the accuracy of analysis results is not improved (the second requirement is not satisfied).
In contrast, it is assumed that the new data D′ is anonymized by the same technique as used for the anonymized data DA1, to obtain anonymized data DA1′, as shown in FIG. 21. In the anonymized data DA1′, if data in five rows enclosed by broken line DL1 is integrated with the anonymized data DA1, the anonymization degree after integration exhibits k=2 since the same data as in the five rows has been already present in the anonymized data DA1.
However, if data in the other ten rows in the anonymized data DA1′ is integrated with the anonymized data DA1, the anonymization degree after integration exhibits k=1 since the same data has not been present in the anonymized data DA1. This decrease in anonymization degree is not preferable in view of the first requirement.
That is, to maintain the anonymization degree after integration, only data in the five rows enclosed by the broken line DL1 among the anonymized data DA1′ can be added. This means that only 5 items out of 15 items in the anonymized data DA1′ (i.e., one third of the entire data) can be added, and this is not preferable for improving the accuracy of analysis results (second requirement) by adding more data.
Based on the inventor's analysis, it is assumed that whether or not more data can be added while maintaining the anonymization degree depends on the anonymization technique to be applied to the original data. In the aforementioned example, in the case where the new data D′ is anonymized by the same technique as used for the original data (anonymized data DA1), the anonymization degree cannot be maintained. However, if the other techniques are used, it may be possible that the number of items of data to maintain the anonymization degree after integration exceeds five. Accordingly, there is a need of evaluating the data addition degree to maintain the anonymization degree when anonymizing the original data.
The problem to be solved by the embodiments is to provide an anonymization indicator computation system that is capable of evaluating the data addition degree to maintain the anonymization degree when anonymizing the original data.