The invention relates to methods and systems for classifying electronic documents, and in particular to systems and methods for detecting fraudulent webpages.
Internet fraud, especially in the form of phishing and identity theft, has been posing an increasing threat to internet users worldwide. Sensitive identity information and credit card details obtained fraudulently by international criminal networks operating on the internet are used to fund various online transactions, and/or are further sold to third parties. Besides direct financial damage to individuals, internet fraud also causes a range on unwanted side effects, such as increased security costs for companies, higher retail prices and banking fees, declining stock values, lower wages and decreased tax revenue.
In an exemplary phishing attempt, a fake website (also termed a clone) may pose as a genuine webpage belonging to an online retailer or a financial institution, asking the user to enter some personal information (e.g., username, password) and/or financial information (e.g. credit card number, account number, security code). Once the information is submitted by the unsuspecting user, it may be harvested by the fake website. Additionally, the user may be directed to another webpage which may install malicious software on the user's computer. The malicious software (e.g., viruses, Trojans) may continue to steal personal information by recording the keys pressed by the user while visiting certain webpages, and may transform the user's computer into a platform for launching other phishing or spam attacks.
Software running on an Internet user's computer system may be used to identify fraudulent web documents and to warn the user of a possible phishing threat. Several approaches have been proposed for identifying a clone webpage. These strategies include matching the webpage's internet address to lists of known phishing or trusted addresses (techniques termed black- and white-listing, respectively). Phishers often change the locations of their websites frequently, which limits the effectiveness of blacklisting.
In the article “Detection of Phishing Webpages based on Visual Similarity,” WWW 2005, May 10-14, 2005, Chiba, Japan, published by the Association for Computing Machinery (ACM), Wenyin et al. describe an approach to detecting phishing websites based on visual similarity. The approach can be used to search for suspicious webpages which are visually similar to true webpages. The approach uses three metrics: block level similarity, layout similarity, and overall style similarity. A webpage is first decomposed into a set of salient blocks. The block level similarity is defined as the weighted average of the similarities of all pairs of matched blocks. The layout similarity is defined as the ratio of the weighted number of matched blocks to the total number of blocks in the true webpage. The overall style similarity is calculated based on the histogram of the style feature. The normalized correlation coefficient of the two webpages' histograms is the overall style similarity.