Computer users, in addition to being exposed to malware attached to e-mail messages, are now are facing malicious software threats from the World Wide Web. Even careful users that only visit trusted Web sites may fall victim if these sites are compromised by a malicious hacker. A hacker can hijack a legitimate site and inject malicious URLs into the Web pages of that site that either automatically open or, when selected, redirect innocent users to a site containing malware. The computer user would then unknowingly be exposed to malware which might be downloaded to the user's computer. Typically a so-called “landing page” Web page is the target of these hackers, but any Web page is susceptible to this type of attack.
FIG. 1 illustrates a prior art environment in which a legitimate Web site is hijacked. Legitimate host Web sites are located on the World Wide Web and hosted upon server computers 10, 20 and 30 that are connected to the Internet. Malicious hackers are able to compromise these host sites and injects a malicious URL into one of the pages of these host sites as indicated by Trojan horse symbols 11, 21 and 31 (symbolizing malware known as a Trojan which might later infect a customer's computer via the malicious URL). There may be any number of malicious URLs injected into a page or pages of a legitimate host site, and a given host site may be compromised by more than one hacker.
A malicious URL is termed malicious because it has been inserted into the legitimate host site by a hacker and links to a malicious Web site such as a malicious site hosted on computers 40 or 50. Accordingly, Web pages 60 and 70 of legitimate sites located on computers 20 and 30 have been compromised and now include at least one malicious URL 62 or 72. When a customer uses computer 80 to view the legitimate site on computer 20 or 30, the web browser at computer 80 will automatically follow the malicious URL link 62 or 72 in which case malicious content from computers 40 or 50 may be downloaded 90 onto the user's computer 80 without user consent. Such a situation is to be avoided.
Currently, the Google Safe Browsing project provides a diagnostic page for various Web sites that record linking information reflecting the security of that site. For example, for a given Web site, the diagnostic page lists the number of pages at the site that result in malicious software being downloaded and installed onto a user computer without user consent, lists the type of malware and how many new processes are surreptitiously started on the target machine, mentions whether the site is an intermediary in the further distribution of malware, and whether or not the site has hosted malware recently. For example, a diagnostic page for the site “yahoo.com” describes that in a 90-day period in 2009 twenty pages of the site resulted in malicious software being downloaded, and that the malicious software included over 200 different malware programs. Unfortunately, this project does not propose any solutions to this problem.
One proposed solution is to crawl all Web sites in order to find malicious sites and uncover the various attack vectors located at these sites. Aside from the problem of the sheer magnitude of sites that must be crawled, many malicious sites are sophisticated enough (or the hacker is) to detect if a crawler is from an antivirus company and may be able to take evasive action to avoid detection. Further, most malicious sites change their domains frequently so the problem becomes a moving target that cannot be hit.
Another current approach is to reference a blacklist that lists which URLs are suspicious of being malicious and then screen these out. A similar approach uses a “Simple RegExp” match to identify an unknown malicious site. The problem with these approaches is that they cannot identify new threats located at newly formed malicious Web sites. Further, it can be challenging to detect a malicious URL in a Web page because there is often not enough direct information to make a positive identification, and, it can be difficult to retrieve the contents of pages pointed to by these malicious URLs.
Accordingly, a new approach is desired to be able to detect malicious URLs in legitimate Web sites in order to prevent malware from being downloaded to a user computer.