As the Internet is expanding, a cyberattack such as a distributed denial of service (DDoS) attack and spam mail transmission has increased rapidly. Most of these attacks are due to malicious software called malware. Attackers cause terminals or servers of ordinary users to be infected with malware and operate the malware to illegally control the terminals and the servers with the intention of collecting information or launching an additional attack. These attacks have become an issue of public concern recently. For this reason, measures for the cyberattack, mainly for the malware infection is urgently required.
As the measures for the cyberattack, measures taken on terminal and measures taken on network have been studied. The measures taken on terminal have been studied, including an approach using antivirus software and an approach using a host-based intrusion detection system (IDS) or a host-based intrusion prevention system (IPS). The measures taken on terminal are executed by software installed in a terminal.
Meanwhile, the measures taken on network have been studied, including an approach using a network-based IDS or a network-based IPS, and a firewall or a web application firewall (WAF). In the measures taken on network, an inspection device is disposed at a connection point on a network. Recently, a security information and event management (SIEM) service or the like that analyzes logs on a terminal or a device to discover a trace of an attack has been provided as well. In both of the measures taken on terminal and the measures taken on network, measures are planned based on feature information regarding known attacks prepared in advance.
Additionally, both of these measures taken on terminal and these measures taken on network collect information on communication relating to attacks. For example, a decoy system called a honeypot is used to collect communication peers and communication content in a malware infection attack and other cyberattacks, while a malware analysis system called a sandbox is used to actually execute malware thereon, thereby collecting communication peers and communication content about the malware. Meanwhile, an anti-spam mail system or an anti-DDoS system is used to collect communication peers and communication content in communication determined as an attack. Furthermore, the feature information is extracted from information on communication relating to attacks. At this time, in many cases, the feature information is automatically extracted from the information on communication relating to attacks using known techniques represented by machine learning.
In an approach for automatically extracting the feature information from the information on communication relating to attacks, the information on communication relating to attacks is summarized based on the categorization into respective items set in advance, for example, date and time, an Internet protocol (IP) address of a communication peer, a port number used in communication, and the number of times of communication and the amount of communication during a predetermined period. At this time, it is common to input an observed value for the date and time or the port number, while a statistic value such as an average value, a standard deviation, or a variance value is input for the number of times of communication or the amount of communication in some cases. Once the categorization has been completed and the summary values have been calculated, for example, a statistical outlier is searched for. When the outlier is discovered, communication relating to this value is determined as an attack and at the same time, this outlier in a relevant item is set as a rule for detecting attacks. Additionally, this value in the relevant item is identified as the feature information observed in attacks.
Furthermore, in regard to the discovered attack, for example, an IP address can be added to a black list and set as the feature information for determining communication in contact with this IP address as an attack. In some cases, a uniform resource locator (URL) of the communication peer is used to create the black list, in which case a regular expression is sometimes used to add the URL to the black list.
Usually, when traffic logs and alerts are collected from different types of devices and software to extract information on communication peers and communication content, description methods for the respective items differ in some cases depending on the types of devices and software. Recently, a technique that converts log information expressed in different types of description to a uniform description method to summarize has been also spread as a security information and event management (STEM) product.
Non Patent Literature 1: R. Perdisci, W. Lee, and N. Feamster, “Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces.,” NSDI, p. 26, April 2010.
Non Patent Literature 2: Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I. Osipkov, “Spamming Botnets: Signatures and Characteristics,” Proceedings of the ACM SIGCOMM 2008 conference on Data communication—SIGCOMM '08, vol. 38, no. 4, p. 171, August 2008.
In the aforementioned conventional technique, however, there is a problem in that the extraction of accurate feature information of attacks is cost consuming.
Specifically, the problem is that a risk is generated when harmless communication is mixed in the information on communication relating to attacks while being collected because the feature information of the harmless communication is mistakenly extracted and a rule formed by the extraction of this information is incorrectly determined as a rule for identifying a malicious traffic log.
For example, the malware often accesses a legitimate web site or the like for the purpose of disturbing analysis or confirming the connection to the Internet. For this reason, there is a possibility of normal access to a legitimate web site being mixed in communication peers and communication content about the malware collected using the sandbox.
Approaches for carefully examining the content of the communication relating to attacks have been studied, including an approach that collects information on the Internet to check the reputation of communication peers and an approach that reproduces the collected communication content on the antivirus software, the IDS, the IPS, the WAF, or the like to inspect whether the determination as an attack is obtained. Regardless of this, there is still a possibility of detection omission or misdetection occurring even according to the respective approaches. It is therefore difficult to automatically and correctly extract communication information on attacks from the information on communication relating to attacks. In particular, the detection omission can be allowed in some cases as it means that an attack that cannot be discovered even with other means cannot be discovered. However, the misdetection must be suppressed to the largest extent possible to prevent the operation cost, for example, for actions and investigation required after the detection from being generated.
In present days, accordingly, in order to identify a rule for discovering an attack and extracting the feature information of the attack, an analyzer is required to manually analyze the content in most cases. As a result, a time cost and a human cost are needed to extract the feature information of attacks and thus, in recent years when attacks are varying to different types, these costs act as a huge bottleneck for a security vendor or a service provider.
The disclosed technique has been made in consideration of the aforementioned situation and an object thereof is to extract accurate feature information of attacks at low cost.