Email spam is a growing problem for the Internet community. Spam interferes with valid email, and it burdens both email users and email service providers (ESPs). Not only is it a source of annoyance, but it also adversely affects productivity and translates to significant monetary costs for the email industry (e.g., reduced bandwidth, increased storage requirements, and the cost of supporting filtering infrastructures). Also, for some categories of spam, such as phish scams, the financial costs for users may be even greater due to fraud and theft.
Generally, spam-filtering techniques can be divided into three broad categories: spam-filtering based on sender-based reputation, spam-filtering based on email-header analysis, and spam-filtering based on an analysis of message content. In the first category, a sender-based reputation framework, senders are classified as either “spammers” or “good senders,” based on criteria such as the sender's identity, the sender's domain, or the sender's IP address. The second category, email-header spam filtering, is based on detecting forgery in the email header and distinguishing the forgery from malformatting and other legitimate explanations, such as those resulting from forwarding activity. The third category, analysis of message content, typically involves machine learning using a classifier for spam detection, using both batch-mode and online update models.
Content analysis using machine learning classification involves several disadvantages, including vulnerability to adversarial attacks, and difficulty in tuning and changing the classification functions. The diversity of messages within a spam campaign may be too low to effectively adjust the filtering function quickly enough. Another problem in automating spam classification is the lack of a consensus definition for spam. What some people consider spam may be considered solicited mail by others. Some email service providers (“ESPs”) allow users to mark emails they consider spam and report them to their ESP, in so-called “TIS” (this is spam) reports. In some cases, users can also report opposite errors, i.e., when legitimate email is mistakenly classified as spam, by submitting so-called “TINS” (this is not spam) reports. However, because user reports rely upon personalized definitions of spam, the value of each individual's judgments may be questionable. For example, many TINS reports may be generated by spammers seeking to legitimize their own spam. Spammers may also submit TIS reports to identify legitimate mail as being spam, in an effort to confuse traditional spam filters. Therefore, traditional spam filtering systems and methods may not satisfactorily identify those entities whose spam reports should be trusted. As a result, traditional spam filtering techniques may fail to sufficiently reduce the costs that users and systems incur as a result of spamming.
The disclosed embodiments of the present disclosure are directed to overcoming one or more of the problems set forth above, by providing systems and methods for creating and updating reputation records and filtering electronic messages.