When network users browse web pages on the Internet, some malicious websites such as phishing websites, Trojan-planted websites, and fraudulent websites threaten their information security.
At present, methods for detecting malicious web page based on web page text contents are capable of achieving satisfactory results. However, to bypass detection engines of security-software vendors, hackers no longer design malicious websites that include many web page text contents, but process malicious web pages by using encryption algorithms and web page virtualization technology, and add dependent web page jumps. Specifically, a dependent web page jump is characterized in that in a complete web page request, a downstream web page depends on related information of an upstream web page, for example, refer, cookie. Consequently, the web page results obtained by the detection engines lack the text content characteristic, resulting in a sharp decrease in the detection capability.
In the prior art, web page contents are generally retrieved by using static crawlers. The principle of static crawlers is similar to that of Wget. The name Wget derives from “World Wide Web” and “get”. It is a free tool for automatically downloading files from a network, supports downloading via the three most common Transmission Control Protocol/Internet Protocol (TCP/IP) protocols, namely, HyperText Transfer Protocol (HTTP), Hypertext Transfer Protocol Secure (HTTPS), and File Transfer Protocol (FTP), and may use HTTP proxies.
Wget downloads web page contents including Hypertext Markup Language (HTML), Cascading Style Sheet (CSS), JavaScript, and Flash files for analysis by a detection engine. The detection engine has to rely on certain fixed components in web pages in order to protect against malicious web pages. However, learning of these fixed components requires manual summarization and relies on prior knowledge and is both time- and labor-consuming. The detection effectiveness is also not satisfactory.
Some security-software vendors having strong research and development abilities have tried the use of active crawlers. An open-source browser kernel (a layout engine such as webkit or gecko) is wrapped so that a crawler can render a web page. The content after web page rendering is then exported for analysis by the detection engine.
However, the above-mentioned detection solutions cannot address newly emerged malicious websites.