E-commerce Web sites and fiduciary institutions such as banks and credit card companies face an increasing problem posed by phishing. Phishing may be generally defined as creating fake copies of legitimate Web sites and then using various ruses to try to direct unwary users to the fraudulent sites to gather identity related information for use in subsequent fraudulent transactions. Typically, a phishing site replicates unique and easily recognizable portions of a legitimate site, such as its trademarks or logos, or familiar text instructions, to delude the user into thinking he or she is on the legitimate site. Often page structure, images and text are copied directly from the legitimate site to the phishing site so that portions of the phishing site are often identical with the legitimate site. To thwart phishing, site owners constantly warn their customers not to give out identity-related information, but such warnings are insufficiently heeded in the face of clever phishing techniques.
There are two principal phishing techniques in vogue. In one phishing technique, the user is lured to the phishing site by means of a phony email message, purporting to be from the legitimate site owner, requesting the user to access a site whose link appears in the email and to enter information—such as the user's user id and password—to prevent some imminent undesired consequence, such as having the account closed. Attempts to counter this phishing technique generally are aimed at the email message used as the lure, by adopting enhanced security arrangements.
In another phishing technique, the lure is not email but the ubiquitous use of public search sites (e.g., Google or Yahoo) to find items of interest to the user. In this technique, the phishing site mimics a site that can be expected to be the target of public search requests, and relies on the similarity of the site to a genuine site and the searcher's inability to distinguish legitimate from fraudulent sites in a list of sites found in a search report. For example, during periods following natural disasters, many relief agencies solicit funds and sites set up to accept donations will be located through Web searches using general search terms such as “tsunami relief efforts” or “Darfur relief efforts”. Legitimate sites are accessible to phishers and according they are able to “borrow” a substantial amount of content, such as photos of destruction, letters of appeal and gratitude, and other content for use on a phishing site. The phishing site takes advantage of the popularity of the event and the relative anonymity and/or obscurity of the relief agencies to lure unsuspecting users to the phony sites, which then request information, usually credit card information, to be subsequently used in fraudulent transactions.
A related phishing technique, also dependent on searches, but this time on a flawed search input, devises sites that are one keystroke error away from a legitimate site's URL, such as www.banklfamerica.com, taking the user to a phishing site.
In each of these phishing techniques, while the paths urging the user toward the phishing site may differ, the attempt is to lure the user to a fake Web site that mimics substantial portions of the legitimate Web site but contains a “hook”—a request for confidential identity information that, when supplied, can be used to complete fraudulent transactions.
Legitimate owners of fraudulently copied Web sites may lose business or donations. In addition, companies that users have a fiduciary connection with, such as a bank, credit card, or an e-commerce site (e.g. Amazon.com), may have to bear all or some of the costs if the customer's account is defrauded. Credit card issuers often absorb the costs of fraudulent card use and may be required by law to limit the card user's liability. Users, even if reimbursed for direct account losses, may suffer temporary loss of credit, impairment to their credit ratings and an enormously difficult and time consuming job of getting the affair resolved and records corrected.
Some approaches to thwart access to phishing sites have been adopted by browser program suppliers. Some examples include the deactivation of links in received emails, or alerting users that various sites they are accessing “might be” phishing sites if they have any characteristics the browsers may choose to associate with phishing sites. However, the present state of the art is such that phishing filters in browsers typically produce so many false positives or warnings that they frequently are seen by users as an annoying interference, and users choose to “continue” to access the sites despite the warnings.
Accordingly, it would be advantageous to enable the Web to be effectively and quickly searched to locate phishing Web sites having a structural similarity to a known site so that they can be countered before they are able to inflict significant harm. There is a further need to provide practical and economical methods arranged and configured to enable such detection.
There has been considerable work done in the prior art on structural comparison of Web sites, primarily in the context of operating search engines to detect the presence of mirrored Web sites and to disregard them so as to reduce the ongoing crawling work that a spider has to do in maintaining a search index, and to reduce redundancy in responses to a client's search query.
For example, U.S. Pat. No. 6,286,006 to Bharat et al. detects mirrored host pairs using information about a large set of pages, including URLs. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites and not recrawl them or return redundant information in response to a search request. In the disclosure of this patent, a search engine looks at the URLs of page's hosts to determine whether the hosts are potentially mirrored.
In another example, U.S. Pat. No. 6,658,423 to Pugh et al. discloses duplicate and near-duplicate detection techniques for operating a search engine which assign a number of fingerprints to a given document by extracting parts from the document, assigning the extracted parts to one or more of a predetermined number of lists, and generating a fingerprint from each of the populated lists. Two documents are considered to be near-duplicates if any one of their fingerprints matches.
These previous techniques are adapted to find mirrored Web sites, which either are identical to hosts or are “near-duplicate” copies with insignificant content differences from the host. Pugh et al. additionally claim to be able to detect copyright infringements. However, these techniques would not be practical solutions for locating phishing sites, first, because they involve the work of completely crawling the Web (a process which is neither economical nor quick) to look for near-replicas of specific pages or portions of a Web site and then essentially to remove them from future consideration. Instead, to detect phishing, it is desirable to be able to quickly find all instances in which selected portions of one known Web site (or a few known Web sites) are present elsewhere in the Web.
The detection techniques of U.S. Pat. Nos. 6,286,006 and 6,658,423 are also not appropriate for detecting phishing sites because they require starting with a complete copy of the URLs or contents of all the sites on the Web before looking for duplications. Pugh et al. explicitly requires the presence of Web documents in toto before the fingerprints used to detect duplication can be assigned. Because an extremely tiny and evanescent fraction of Web sites are phishing Web sites, these prior art techniques—designed for the very different purpose of countering the adverse effects on Web searching of many legitimate forms of Web redundancy—are neither sufficiently focused on the desired result nor sufficiently fast to be useful in detecting phishing sites.
Another form of structural comparison is disclosed in Sergey Brin, James Davis and Hector Garcia-Molina, “Copy Detection Mechanisms for Digital Documents,” Proceedings of the ACM SIGMOD Annual Conference, San Jose 1995 (May 1995) incorporated herein by reference. An available version of the paper can be found, for example, at http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1995-43&format=pdf&compression=&name=1995-43.pdf. This paper discloses a method which determines whether an identified document is a copy of a specific preidentified copyrighted article. As described in the paper “the service will detect not just exact copies, but also documents that overlap in significant ways.” However, the method requires that the document to be tested for legitimacy be identified to start with, and thus would not be of use in finding a “phishing” web site whose location and existence are unknown.
Accordingly, there remains a need for a method for detecting phishing sites that is effective, efficient in the sense that it does not require massive computational capacity, and at the same is quick and simple so that legitimate Web site owners can be made aware of phishing sites without great cost and on a prompt basis.