Nowadays, various kinds of information are shared through networks such as the Internet, an intranet, and a LAN, and are thus getting more usable and accessible. A server for managing contents and the like to be provided is used to manage information and provide the information to information users on the Internet or the like. The server accepts an access from a client device connected to the server through a network, and executes processing such as provision of requested contents, user registration, or registration/update of personal information.
Conceivable servers connected to the network include a mail server enabling the sending/receiving of emails using SMTP; a web server implementing a Common Gateway Interface (CGI) and the like for providing web services using the HTTP protocol; an FTP server; and a database server managing various kinds of data and providing the data in response to an access request. Every time these servers execute processing, the servers accumulate therein information on users accessing the servers, authentication results, data contents sent for the processing, execution results, and the like. The information thus accumulated differs depending on the types of the servers, but mainly includes a source IP address, a source domain name, an access time stamp, an accessed file name, a link source URL, a web browser name and an OS name of a visitor, the time spent for the processing, the number of received bytes, the number of transmitted bytes, and a service status code, and the like. An information processing apparatus such as a server accumulates information through operations and record it in a file, database, or alike, which is simply referred to as a log hereafter.
As described above, logs created by a server include high use-value information to a great extent. Hence, through log analysis, the logs are applicable to, for example, examine a history of malicious attacks such as distributed denial-of-service attack (DDoS) attacks to the server, or a history of unauthorized access to the server, or to carry out market analysis by statistical analysis on information on accesses to the server, and the like.
Besides, with respect to illegal accesses etc., to servers which frequently occur recently, logs may also be usable to survey time-sequenced changes and target transitions of attackers on the network more comprehensively by analyzing transversely the logs obtained at plural organizations. However, since a log may include basic network information and personal information as described above, there is a risk of data leakage through disclosure of logs to an external analysis vendor for log analysis, or disclosure of logs across multiple domains even if the domains are reliable.
FIG. 10 shows an example of an access log 1000 of a web server implemented using Apache 2.0 and a transaction log 1100 of an FTP server. In FIG. 10, network information, private information, and port information are replaced with asterisks “*” to conceal them. As shown in FIG. 10, a log may include server backbone information such as a fixed IP address of a server, a port number being used, and a hierarchical directory structure, and also include private information such as a user ID and extremely highly confidential information such as a password. However, since a large variety of information can be recorded in a log, the location of string in a log where highly confidential information is included is different depending on the content of the log.
For example, disclosure of the raw logs of FIG. 10 to an external party poses a risk to a company because it means disclosure of network information, server information, personal information, and the like of a company or organization to external parties. In addition, if the logs are leaked to malicious attackers, there are risks that high value-added information accumulated by a company may be destroyed and be plagiarized by hacking, and the company may be targeted by denial-of-service (DoS) attacks and the like.
Hence, by providing a raw log to an external analysis vendor, a company or organization using a server can get useful information but, in return, has to face high risks of confidential information leakage, privacy information leakage, information leakage by an unauthorized access to the server, and the like. For these reasons, even if disclosure of a log to a third party aims to analyze a history of accesses to a server and to reflect the analysis result on functions of the server, the disclosure still faces a high hurdle beyond the coverage of a nondisclosure agreement, which impedes flexible log analysis. Further, if highly confidential information can be found in log information, the highly confidential information may be collectively replaced with asterisks or the like. In such a case, however, the log sometimes loses information indicating the identity of the accessing person or the identity of the accessed data. Thus, it is preferable to conceal log information in a way such that the attributes of the original data as well as the identicalness of original data is kept recognizable.
Methods of judging a confidentiality level of a log have been heretofore known. For example, Japanese Patent Application Publication No. 2009-116680 (Patent Literature 1) aims to provide a technique for easily and precisely detecting a data type of an input/output data of a computer, such as the presence/absence of confidentiality to contribute to proper management of the data. The technique described in Patent Literature 1 is for judging the data type precisely by the machine learning and includes; reading means for reading the input/output data; data contents acquiring means for acquiring a character sequence included in the input/output data; feature extracting means for extracting, as a feature, the character string or a given character group included in the character string; and data type judging means for judging a data type of the feature by referring to data type learned results stored in an external storage device and obtained by machine learning using training data whose data types are previously known.
The method described in Patent Literature 1 enables judgment of confidentiality of information in a log. However, since the training data is used for judgment, it is not possible to judge confidentiality of information not included in the training data, leaving a risk of confidential information leakage. Besides, a technique of detecting a confidential words based on regular expressions and a word list is not a sufficient solution because it has limits due to a huge amount of effort for data construction, omission of words, and the like in registering types of regular expressions and registering words in a word list. It is also conceivable to define a perfect schema for a log in advance and anonymize confidential information in accordance with the schema; but it is not realistic to create a variety of perfect schemata for a variety of logs to be created. Further, no matter how many words or schemata are added, there are numerous uncommon names. Furthermore, it is also necessary to deal with a log containing wrongly inputted information such as a mistyped user ID/password and data inputted in a wrong field.