Spam hosts are web hosts that provide unsolicited advertisements and/or email messages. These spam hosts exist solely to get advertisements viewed by users.
A variety of methods have been proposed to detect spam hosts. Depending on how the spam detection decision rules are constructed, existing detection methods or solutions can typically be separated into two categories: expert system based and supervised learning based. In the expert system based solutions, experts encode the rules which are then used to decide whether a new host is a spammer or not. Such rules may be derived from either payload properties or traffic properties, or from a combination of both, or from deriving properties of infrastructure hosting spammers such as botnet memberships. Major drawbacks of systems using this approach are the lack of adaptability to future traffic when traffic characteristics change over time, lack of portability to new networks since specific characteristics can be very different across networks, limited coverage of decision rules as an expert can provide rules only for the subset of cases she/he understands, and difficulty in modifying the rules as modifications are done manually.
The supervised learning based solution is more flexible and overcomes many of the shortcomings of the expert systems approach. There are also approaches that identify spammers based on IP-level clusters where hosts are clustered based on their IP addresses alone; these approaches apply only to hosts which share part of their IP address with spammers.
In the supervised learning approach, one begins with a collection of hosts, each labeled as a spammer or a non-spammer, and the hosts' traffic patterns (training data), and automatically learns a decision rule to classify new hosts as spammers or not. While this works fine as long as new hosts have traffic patterns similar to those in the training data, the performance deteriorates when new traffic patterns emerge. These new patterns may correspond to spammers or non-spammers. This supervised learning approach also does not perform well when traffic patterns begin to deviate from the initial set because conceptual categories are not learned but rather one learns to classify categories whose instances are provided in the initial training data. The other drawback is the requirement of a fairly curated collection of spammer hosts and their behaviors, which is hard to obtain in practice. In practice, identification of certain spammers may be known but their traffic patterns may not be known. At the other end of the spectrum, traffic patterns of many hosts can be observed, but one may not know whether they are spammers or not.
Existing solutions either assume a strong a priori knowledge of fixed characteristics of spamming hosts or assume an explicit knowledge of which hosts are spammers. Solutions based on either of these are not realistic because spammer characteristics change over time and we may not know which hosts are truly spammers. Essentially, spammer detection remains a challenging and an open problem.