The advent of global communications networks such as the Internet has presented commercial opportunities for reaching vast numbers of potential customers. Electronic messaging, and particularly electronic mail (“email”), is becoming increasingly pervasive as a means for disseminating unsolicited, undesired bulk messages to network users (also denoted as “spam”) including advertisements and promotions, for example.
Despite many efforts with respect to reduction and prevention, spam continues to be a major problem. According to industry estimates today, billions of email messages are sent each day and over seventy percent are classified as spam. Individuals and entities (e.g., businesses, government agencies) are becoming increasingly inconvenienced and oftentimes offended by junk messages. Furthermore, spam is forcing businesses to pay enormous amounts (billions of dollars worldwide) for internal messaging infrastructure and support staff. As such, spam is becoming a major threat to trustworthy computing and electronic messaging.
A significant technique utilized to thwart spam is employment of filtering systems/methodologies. One proven filtering technique is based upon a machine learning approach. More particularly, machine learning filters are employed to assign a probability to an incoming message indicative of whether the message is spam or non-spam. Conventionally, pre-classified messages are utilized to train a filter to discriminate probabilistically between message types. For example, a group of users can be polled to facilitate labeling of messages as spam or non-spam. Once trained the filter or associated learning model can be employed to classify messages.
There are two main types of filters utilized, namely content-based filters and internet protocol (IP) address-based filters. As the name suggests, content-based filters are trained to analyze message content or text such as words and phrases in the subject and/or body of a message to facilitate identification of spam. IP address-based filters learn about addresses associated with messages with respect to a set of training data. Subsequently during classification, the filter extracts an IP address from a message and infers whether it is spam.
Unfortunately, spammers have adapted to the onslaught of spam filtering techniques by finding ways to disguise their identities to avoid and/or bypass spam filters. Thus, conventional content-based and IP address-based filters are becoming ineffective in recognizing and blocking disguised spam messages. Moreover, simply training such spam filters to be more aggressive is not an adequate solution as this technique results is a higher volume of false positives, where legitimate messages are labeled as spam.