Conventionally, there are malicious website discrimination techniques that are techniques for discriminating malicious websites that cause malware infection among the websites that are accessed by user terminals. These malicious website discrimination techniques are divided into dynamic analysis and static analysis. The dynamic analysis is a technique for discriminating malicious websites by executing contents, such as HTML (HyperText Markup Language) or JavaScript (registered trademark), acquired from a server, such as a web server, and detecting attacks on vulnerability in accordance with the behavior in the host at that time. Conversely, the static analysis is a technique in which contents are only acquired from a server, code analysis is conducted on the contents, and by using information associated with the characteristics of a program, domain, or URL (Uniform Resource Locator), malicious websites are discriminated in accordance with differences from the characteristics that are observed in benign websites.
An explanation is given below of a conventional technology related to discrimination of malicious websites due to static analysis. Malicious website discrimination techniques using static analysis include a technique for making discrimination by using the characteristics of contents that are acquired from a single page and a technique for making discrimination by extracting characteristics from accesses to multiple pages, which occur due to redirect of a website. With regard to the former discrimination technique, many techniques have been proposed, which make discrimination by analyzing JavaScript (registered trademark) in contents. Among these techniques, there is a technique (non patent literature 1) in which a feature vector is described by using for example the number of lines or the number of characters in the script or the number of times that a keyword appears in the script and the technique of machine learning is used to make a discrimination. Alternatively, there is a technique (non patent literature 2) in which processing in the for sentence or the while sentence is extracted and a feature vector is generated to make a discrimination. Furthermore, there is a technique (non patent literature 3), or the like, in which a sequence is generated, which is made up of the node type of an abstract syntax tree, generated from JavaScript (registered trademark), and a discrimination is made in accordance with the degree of similarity of the sequence.
There are techniques that use HTML, URL, and information on hosts as well as JavaScript (registered trademark). For example, the techniques that use HTML include a technique (non patent literature 4) in which a discrimination is made by generating a decision tree that makes contents malicious determinations in accordance with the number of times that iframe or script tag appears, the size of a tag, or the like. Furthermore, the techniques that use URL and host information include a technique (non patent literature 5) that make URL malicious determinations by using a keyword included in a URL, DNS (Domain Name System) inquiry result, or geographical information that is associated with an IP (Internet Protocol) address. Furthermore, there is also a technique (non patent literature 6) in which classifiers are constructed, which make malicious determinations based on each piece of information of JavaScript (registered trademark), HTML, URL, and host, and determinations of the classifiers are combined to make malicious determinations of websites. Alternatively, there is a proposed technique (non patent literature 7) in which a difference is extracted from the contents that are acquired by accessing to the same page at different times and malicious determinations are made on the basis of the difference.
In addition to the above, malicious website discrimination are made by using various methods in the technique that makes discriminations on the basis of multiple times of accesses to websites triggered by redirections. For example, according to the technique (non patent literature 8) of Matsunaka, et al., the redirections triggered by attacks on vulnerability are detected on the grounds of the HTTP (Hyper Text Transfer Protocol) header at the time of download of an execution file, the contents acquired before download with no information that indicates download, or the like. Furthermore, according to the technique (non patent literature 9) by Stringhini, et al., a group of accesses with the same final page is generated from the time-series sequence of pages that are accessed by each user during automated redirection, and then a feature vector, such as the number of IP addresses or redirections, is generated from the group of accesses to discriminate a malicious website. Furthermore, according to the technique (non patent literature 10) by Rafique, et al., the part needed for download of malware is extracted from the sequence of pages that are accessed during redirection by individually accessing the pages in the sequence, and a signature is generated to discriminate a malicious website.