With the increasing utilization of computer based devices and systems like desktops, smart-phones, tablets, smart televisions, networks, and the internet for personal as well as commercial use as well as the continuing growth of the world-wide-web (IPv6) comes a proliferation of threats that jeopardize the secure usage of these devices and systems. For example, users of network enabled computer based devices like desktops, laptops, smart-phones, tablets, and smart-televisions are exposed to a variety of risks like financial fraud, loss of privacy, loss of critical information, as well as other threats generated by malicious software. These threats are constantly evolving and changing to avoid detection. At the same time as these threats change and evolve, threat research generally monitors and analyzes new software applications and network activities to defend against these threats. A specific type of threat to these systems are sites, that without permission, alleges to act on behalf of a third party with the intention of confusing viewers into performing an action with which the viewer would only trust a true agent of the third party, also known as phishing.
The uptime of phishing sites is relatively short. For example the median uptime for phishing sites in the year 2010 as determined by APWG was around 12 hours. Every day phishing sites are detected and taken offline. At the same time new phishing sites are brought online. This on-going competition between the creators of phishing sites and the people that combat phishing results in a continuous adaptation in the design and setup of phishing sites. The challenge in the fight against phishing is to keep up with the changing phishing strategies while maintaining a high detection rate at a low cost. Both, detection rate and the costs associated with the detection are important factors in keeping the financial incentive for creating phishing sites below some threshold and, thus, in controlling the extent of phishing. In order to handle the ever evolving phishing technology efficiently and economically a detection system is needed that is capable to adapt to a large extent automatically and with a short lag time to the changing environment.
The automatic adaptation of known phishing detection systems is limited by the information utilized and the preprocessing applied to the utilized information. In particular, current detection systems utilize a predefined and constant subset of the available information thus limiting the systems' capabilities to adapt to changes not contained in the subset. In addition, current systems preprocess the utilized information. The preprocessing is based on some understanding or prior knowledge how phishing sites work currently. For example, domain names of phishing sites tend to contain more forward slashes than non-phishing sites. Current detection systems utilize this prior knowledge by counting the number of forward slashes in the domain name and using this count as a feature. Most likely, phishing sites will adapt over time rendering the number of forward slashes in the domain name useless in signaling a phishing site.