Electronic content, such as web pages, search results, and other types of documents, often includes links to other documents, web pages, and the like. A link is a connection from one page to another that can be selected from a first page to cause the other page to appear in a web browser application or the like. The links on a page can be defined by the author of the page, or can be added to the page automatically, e.g., by an advertising system that adds advertisements to the page. Pages can be generated by an automated system such as a search engine, which identifies web pages that are available via a network such as the Internet, and adds links to those pages based upon the addresses that are found by the search engine. Attracting users to a web site can be desirable for a number of reasons. Illegitimate or malicious web sites can to attempt to gather private information such as email addresses and passwords from users, e.g., through phishing attacks. There are also more benign reasons to attract users to a web site, such as to increase the site's traffic, the number of times advertisements have been viewed on the site, and so on.
Web link spoofing attacks have been developed to attract users to web sites that the users do not intend to visit. These spoofing attacks deceptively present an illegitimate web link that appears legitimate. For example, suppose that a web site named Good Web Site has a legitimate web link good.com. The link good.com looks similar to the link g00d.com, in which each letter o is replaced by the number 0. The two links look particularly similar if they are displayed in uppercase, i.e., G00D.COM and GOOD.COM. As another example, the letter l in a legitimate link can be changed to a number 1 to create an illegitimate link that is visually similar to the legitimate link.
URL's can contain characters from numerous international languages. There are a number of characters in different languages that look alike. Characters that look alike are referred to as homographs. URL spoofing attacks that take advantage of the visual similarities between different characters that can be from different languages are thus referred to as Internationalized Domain Name (IDN) homograph attacks. For example, the English letter c (pronounced cee) looks similar to the Russian letter c (pronounced ess). A URL that includes an English c, such as chase.com, can be spoofed by a URL that uses a Russian c in place of the English c, and looks very similar, such as chase.com. A user can be lured to an illegitimate version of the chase.com web site that is registered to the spoofed chase.com domain name by presenting a hyperlink having a URL that refers to the spoofed chase.com. Users are unlikely to see the difference between the legitimate and spoofed domain names, and thus unlikely to be aware that they are accessing an illegitimate web site, particularly if the illegitimate web site's appearance is similar to that of the legitimate chase.com site.
Because of such visual similarity between different characters, users can be lured into clicking on or selecting the illegitimate link when they intend to access the legitimate web site. When the user follows the web link to the illegitimate g00d.com web site, the illegitimate web site is loaded and displayed, and the spoofing attack has succeeded. The illegitimate site can, for example, display information or advertisements, attempt to convince the user to perform a transaction, request information from the user, attempt to install malware or spyware on the user's computer, and perform other malicious or potentially damaging operations. Spoofed web links can lead users to phishing attacks, in which an illegitimate site is designed to mimic sites that contain important user information and convince the user to login, thereby providing an attacker with their user name and password.
A web link ordinarily has two parts: link text and a reference to a target web page, such as a Uniform Resource Locator (URL) that identifies the target web page. The link text is displayed on a web page to visually represent the link. The link text can be clicked on or selected to cause a web browser to load the target web page referred to by the URL. The link text can also be referred to as anchor text, a link label, or a link title.
For example, in the Good Web Site example, a link to the site can have link text such as “Good Web Site” or “www.good.com”, and a link URL such as www.good.com. In one aspect, a legitimate link URL correctly references the web site described or implied by the link's link text, such as www.good.com. An illegitimate link has a URL, such as g00d.com, that references a web site different from that described or implied by the link text. The link text is not necessarily the same as the link URL, and can be a description or name of the web page referred to by the link instead of a textual copy of the link URL. However, illegitimate links often set the link text to a URL, at least in part because users are more likely to trust and follow a link that is displayed as a legitimate-looking URL, as opposed to a link displayed as a word or phrase. Therefore, illegitimate links can set the link text to a legitimate URL and set the URL to an illegitimate link in an attempt to lure users into following the legitimate URL. Alternatively, illegitimate links can set the link text to an illegitimate URL, e.g., g00d.com, that looks similar to the legitimate URL, such as good.com, and again set the link URL to the illegitimate URL, g00d.com, so that a comparison of the characters that represent the link text to the characters that represent the link URL will indicate that both are the same, and such a comparison will not identify the link as illegitimate.
Although the link text is ordinarily displayed on web pages to represent the link, the user is able to view the link URL itself, e.g., by placing a cursor or mouse pointer over the link text, and when a user actually opens the illegitimate page. Thus, the user can then attempt to visually verify that the URL is legitimate by placing the mouse pointer over the link text prior to clicking on the link, and checking the URL that is displayed. The user can also attempt to visually verify the URL by clicking on the link, allowing the target page to begin loading, and visually verify the URL that is displayed in the browser's address bar. In either situation, if the URL appears to be illegitimate, e.g., because it references a web site or contains text that does not appear to be related to the link text, then the user can decide to ignore the link or the loaded target page. However, if the URL appears to be legitimate, then the user is likely to follow the illegitimate link or read the illegitimate loaded target page. It would be desirable, therefore, to protect users against web link spoofing, so that users do not unintentionally access illegitimate web pages.
Existing techniques for blocking web link spoofing attacks include filtering based on heuristics, and blocking sites that appear on lists of known unsafe pages. The heuristics can be used to identify suspicious messages and, and require additional effort, e.g., a confirmation input, by users. Both the filtering and the site blocking lists can fail against modern attacks. The filtering technique can fail for a number of reasons, such as a relatively high false-positive rate that leads users to disable the features or ignore warnings, even for content that is actually an attack. Further, the attacker can design messages to avoid detection, e.g., by indirectly determining the heuristics used by the filter, or by directly testing their messages against their own copy of the filtering software. For example, an attacker could send their message to themselves, and change the message until it passes through the filtering software. The site blocking lists fail because of the delay between the start of the attack and detection of the attack. A potentially large number of victims can be attacked before the attack is detected. Neither of these techniques works against attacks that are specifically targeted against a small number of victims. Targeted attacks involve tailoring messages to bypass filtering, and the small volume of attacks reduces the likelihood of the phishing site being detected at all, let alone early enough to detect all attacks.