Email spam is a growing problem for the Internet community. Spam interferes with valid email, and it burdens both email users and email service providers (ESPs). Not only is it a source of annoyance, it also adversely affects productivity and translates to significant monetary costs for the email industry (e.g., reduced bandwidth, increased storage requirements, and the cost of supporting filtering infrastructures). Also, for some categories of spam, such as phish scams, the financial costs for users may be even greater due to fraud and theft.
Generally, spam-filtering techniques can be divided into three broad categories: spam-filtering based on sender-based reputation, spam-filtering based on email-header analysis, and spam-filtering based on an analysis of message content. In the first category, a sender-based reputation framework, senders are classified as either “spammers” or “good senders,” based on criteria such as the sender's identity, the sender's domain, or the sender's IP address. The second category, email-header spam filtering, is based on detecting forgery in the email header and distinguishing the forgery from malformatting and other legitimate explanations, such as those resulting from forwarding activity.
The third category, analysis of message content, has been of particular interest to the machine learning community. Machine learning environments using a classifier, and which apply spam detection, use both batch-mode and online update models. Training a classifier in batch mode allows the use of a wide range of algorithms and optimization of performance over a large quantity of training data. Conversely, unless the classifier is frequently retrained, the system may quickly fall prey to adversarial attacks. Online learning approaches, on the other hand, allow for immediate incorporation of user feedback into the filtering function, but tend to be more difficult to tune, and the number of efficient algorithms is limited. In either approach, changes to the classification function may require a significant number of new examples, especially if the amount of data used to derive the current models was already very large. The diversity of messages within a spam campaign may be too low to effectively adjust the filtering function quickly enough. It is therefore convenient to consider augmenting the operation of a conventional spam filter with one that tracks high-volume spam campaigns and attempts to eliminate those mailings only.
Another problem in automating spam classification is the lack of a consensus definition for spam. What some people consider spam may be considered solicited mail by others. Some email-service providers allow users to mark emails they consider spam and report them to their ESP. In some cases, users can also report opposite errors, i.e., when legitimate email is mistakenly classified as spam. However, because user reports rely upon personalized definitions of spam, the cost of a large ESP to incorporate each individual's judgments into the filtering system may outweigh the benefits. Nevertheless, spam reports provided by users, as well as other forms of data acquisition have been used to build and validate spam detection systems.
Of particular interest is the use of such data to track spam campaigns sent in volume over defined periods of time, with a spam campaign assumed to consist of highly similar and often near-duplicate messages. In that context, when many users report nearly identical emails as spam, one can reasonably label a campaign as spam based on the volume of user reports received. A key requirement to the success of such a scheme is the ability to identify emails belonging to the same campaign, despite small or irrelevant differences (some tactically inserted by the spammer to complicate detection). The problem can be otherwise described as near-duplicate message detection, which has received considerable attention in the field of information retrieval, and as near-replica (and sometimes exact-replica) message detection in the email domain.
In summary, a duplicate-based spam detector decomposes each message into one or more fingerprints or signatures, and uses them for indexing, as well as for computing message similarity. Operationally, a few signature-based hash-table lookups are used to determine whether highly similar messages have been labeled spam and to act on an incoming message accordingly (i.e., signature-based deduplication). Fingerprinting algorithms differ in the attributes they use for signature computation (e.g., direct message content, message blocks, and subsets of text features), and the number of signatures per message (i.e., number of different fingerprinting algorithms applied). Using message signatures, clustering techniques can be used to verify cluster membership. That is, once a cluster signature becomes known (e.g., via user reports), it is easy to determine whether an arbitrary message falls into the same cluster. Signature-based deduplication is a form of clustering in which the stream of all incoming emails is clustered to identify high-density spikes in the content distribution, which are likely to correspond to spam campaigns.
The prior art methods may not adequately reduce the response time of spam filtering systems by recognizing a spam campaign at an earlier stage, and may not adequately incorporate user feedback. Moreover, the prior art methods may not perform automatic maintenance of a reliable user set. Therefore, the prior art systems may not satisfactorily reduce the costs that users and systems incur.
The disclosed embodiments are directed to overcoming one or more of the problems set forth above.