As the use of Internet technology increases, computer-based fraudulent, malicious attacks have increased. Phishing, an effort to trick an Internet user to get his or her personal information through a fraudulent webpage, is an example of such a malicious attack. The information typically targeted by a phishing attacker includes a user's login information and personal data (e.g., login identification (ID) and password, credit card details, and personal account information).
A fraudulent webpage is a look alike of a well-known site. Anybody can host a site on the Internet or buy a domain name or an SSL (Secure Sockets Layer) certificate with little or no verification. Though there are rules against fraudulent uses of these privileges, criminals continue to abuse the Internet. The time and effort to take remedial action against these scammers take considerable resources. This problem is compounded by the fact that many of these sites are hosted across international boundaries where the legal systems are different or ineffective.
Phishing attackers use various approaches to direct traffic to the fraudulent webpage. For example, they may distribute legitimate-looking emails by the millions (“spam”). Some use social engineering techniques. Though savvy users can easily identify such emails, a portion of novice users are victimized.
The most common technique used by phishing attackers is to capture user traffic. Here the phishing attackers team up with cybercriminals and release malware that interferes with DNS (domain name system) discovery or browser settings and data of the unsuspecting customers to redirect traffic to their fraudulent site. As new software keeps hitting the markets, new vulnerabilities are found on a regular basis and there is always a lag before the vulnerable systems are patched.
Typosquatting or uniform resource locator (URL) hijacking is another type of tricking technique phishing attackers often follow. Scammers host their site on domains that spell like the original and then rely on a typo from an unsuspecting user to score a hit. Again, by manipulating the search engine rankings, a phishing site may get indexed on search engines and then have traffic sent to them. The attacker can also use Internet advertisements that forge the logos or other servicemarks or trademarks of legitimate businesses and have a target link to their phishing host. Often, the advertisements use social engineering techniques such as offering a cash bank or such incentive to the user for clicking the link.
Financial institutions and e-commerce websites are the usual targets for these phishing attackers. From these websites, the criminals can quickly transfer funds or purchase goods using stolen accounts before getting traced. Hence, it is in the best interest of these legitimate businesses to quickly find the phishing hosts and take them down.
There are various available solutions to protect against phishing. For example, constantly scanning and filtering spam emails may reduce the chance of phishing attacks. Also, keeping the filtering rules up to date is important to keep the filtering work successful.
Other techniques focus on identifying phishing webpages by looking into different parameters such as URL and webpage content. The heuristic based models generally first identify the characteristics of the URL by using a single and/or set of heuristics generated on URL similarity calculation, domain name probability evaluation, number of the external links of the webpage, IP (Internet Protocol) address, and the port number. Metadata about the webpage, specifically ranking, registration information, and category are also obtained by the method by querying the third party libraries such as Whois and Google Pagerank.
The phishing webpages are then identified based on the comparison on heuristic results and metadata information between the suspected webpage and the legitimate webpage. Most of the methods in this category are based on a blacklist created by populating the already identified phishing webpages.
As an alternative heuristic method, the whitelist based methods prevent access to explicit phishing sites and warn for phishing suspicions by using URL similarity checks. The whitelist is prepared based on user visits to webpages where he or she previously submitted personal information. When the user accesses a webpage, the URL and IP information is passed to the module called Access Enforcement Facility (AEF) to check if the webpage is a phishing site or not. The AEF module provides the result by simply comparing the URL and IP of the webpage with the trusted webpages already stored in the whitelist.
The risk assessment based methods quantitatively analyze the security risk of a webpage and evaluate the security risk of the webpage against the phishing attacks. The security risk is calculated through a number of steps: definition of the risk elements; weight between risk elements; measurement of the risk grade; calculation of the total security index; calculation of the max security index; and calculation of the webpage security risk index.
The other broad category of phishing detectors is the content-based methods. The webpage content-based methods primarily extract a variety of page elements (such as keywords, description, copyright information, frame URLs, image URLs, and corresponding texts of these links) as features, and use different machine learning tools (e.g., classifiers) to detect phishing attacks. These methods primarily use heterogeneous feature classifiers to classify features, and then with the use of suitable (ensemble) learning methods, train the phishing detector to identify phishing attacks.
Based on the assumption that a phishing webpage aims to mimic the appearance of the targeted page (i.e., legitimate page), there are some approaches that focus on identifying phishing webpages based on visual similarity. Most of the visual similarity based phishing detecting methods divide the webpages into different parts and then compare the visual similarity between each pair of parts in the legitimate and suspected webpages. For example, a visual similarity based phishing detecting method can measure the visual similarity between two webpages in three aspects such as block level similarity, layout level similarity, and overall style similarity. In the block level it considers the texts and images in the webpage and extracts features for all blocks in the webpage. The weighted average of visual similarities of all matched block pairs between two pages are calculated and taken into consideration for similarity identification. The layout similarity identifies the deviation of similarity in the layout between two webpages. The layout similarity of two webpages is defined as the ratio of the weighted number of matched blocks to the total number of blocks in the true webpage. The overall style similarity mainly focuses on the visual style of the webpage including the font family, background colour, text alignment, and line spacing. The overall system uses the same technique of feature extraction and classification to detect phishing webpages. In another visual similarity based approach, the method quantifies each text element based on a 6-value tuple such as textual content, foreground colour, background color, font size, name of the corresponding font family, and the position in the page, and each image is quantified based on a 5-value tuple source attribute, dimension, colour histogram, 2D Haar wavelet transformation, and position in the page. Then, the visual similarity matching process is executed to identify visual similarity among similar aspects or tuples between two webpages. Among other approaches, human interactive proof methods are based on human recognition of images in detecting phishing webpages. For all of the above visual similarity based approaches, identifying the aspects and/or components for the visual similarity checking, and executing the similarity checking methods for a large number of such aspects and/or components between two webpages in time-efficient manner have been the major challenges.
There is a need for systems and methods that uses the overall structural information or overall visual information or both of the webpages to detect phishing. Such systems and methods would overcome the challenges of the existing visual similarity and content based phishing detecting methods.