Big data refers to data including all of unstructured data and semi-structured data not utilized so far, like e-commerce data, metadata, web log data, radio frequency identification (RFID) data, sensor network data, social network data, data of Internet text and documents, Internet search indexing data, as well as all of structured data used by conventional enterprises or public institutions. Data as such is referred to as big data in the sense that common software tools and computer systems cannot handle such a huge volume of data.
And, although such big data may be insignificant by itself, it can be useful for generation of new data, judgment, or prediction in various fields through machine learning on patterns and the like.
Recently, due to the strengthening of a personal information protection act, it is required to delete information that can be used for identifying individuals from the data or to acquire consent of the individuals in order to trade or share such big data. However, it is not easy to check if a large amount of big data includes information that can be used for identifying the individuals, and it is impossible to obtain the consent of the individuals. Therefore, various techniques for such purposes are emerging.
As an example of a related prior art, a technique is disclosed in Korean Patent Registration No. 1861520. According to this technique, a face-concealing method is provided which includes a detection step of detecting a facial region of a person in an input image to be transformed, a first concealing step of transforming the detected facial region into a distorted first image that does not have a facial shape of the person so that the person in the input image is prevented from being identified, and a second concealing step of generating a second image having a predetermined facial shape based on the first image, transforming the first image into the second image, in the input image, where the second image is generated to have a facial shape different from that of the facial region detected in the detection step.
However, according to conventional techniques as well as the technique described above, whether identification information such as faces, text, etc. is included in the data is determined, and at least one portion corresponding to the identification information is masked or blurred, thus machine learning cannot utilize such information due to damage to original data, and in some cases, the data even contains unexpected identification information and the unexpected identification information cannot be concealed, e.g., anonymized. In particular, a conventional security camera performs an anonymizing process by blurring all pixels having a change between frames in a video image, and when the anonymizing process is performed in this manner, critical information such as facial expression of an anonymized face becomes different from information contained in an original video image, and the personal identification information missing during face detection may remain on the original video image. Also, the blurred video image may be reverted to the original image using one of conventional video deblurring techniques.
Accordingly, the inventors of the present disclosure propose a method for generating obfuscated data such that the obfuscated data is different from the original data while an output result of inputting the original data into a machine learning model and an output result of inputting the obfuscated data into the learning model are same or similar to each other.