In the field of network communications, more particularly Internet network communications, spam messaging in the form of unsolicited email has become more and more prevalent, targeting both commercial and private consumers. Spamming, generally defined, is the process of sending mass unsolicited messages to network users in the form of e-mail messaging or other text messaging.
There are a variety of known technologies that have rather recently been developed to fight spam messaging, and these are collectively known in the art as spam filtering. Typical prior-art spam filtering techniques rely on the presence of some common and/or unusual traits in spam messages and attempt to classify messages as spam messages according to detection and sometimes analysis of those traits.
Arguably the most prevalent existing spam filtering systems are software applications that use word detection of pre-compiled key words or in some cases phrases that are known to appear in spam messages. These structured text-based filters look for keywords or phrases that appear in email headers, subject lines, and message bodies. There are Bayesian filters, statistical filters, white and blacklist filters, and heuristic filters that perform a number of tests on messages and compare weighted values against a pre-defined weight threshold. Many of these filters can be trained by fine-tuning. For example, manually selecting a message that has escaped filtering and marking that message as spam can add new parameters to the filter criteria so that in the future similar messages will be detected and identified as spam.
Spam filtering is typically performed locally (typically at a users station) by software installed thereon or in many cases as a service at the server-side of a user's connection by server-based software, as is the case with most Web-based e-mail servers. Often there are software components at both sides of a communications link. There are also private and public databases (blacklists) containing identification information of known spam senders. Blacklisting occurs when spam is discovered and can involve listing parameters about the spammer like IP address, company name and address, or URL addresses that are known to be spam related.
As the process of spamming evolves many methods and tools are, at the time of this writing, being developed by spam-sending entities that focus on ways to get around conventional filtering protections. For example, keywords and phrases that might be subject to filtering by text-based parsing and comparison to known words or phrases are masked using hidden characters that are machine readable but do not appear to a human recipient. Keywords are often intentionally misspelled as well as rearranged with respect to phrasing. Spammers also insert characters into message headers and message bodies or into URL strings in an attempt to hide from conventional filtering systems. Filtering for phrases and phrase variations is also time consuming and process intensive and therefore not completely practical for most applications.
Spammers also use well-known spoofing methods to hijack trusted machines, universal resource locators or domain names of trusted sources, and sometimes set up fraudulent (counterfeit) Web sites for interaction, the Web sites emulating those sources. Real contact information is often masked to foil automated location attempts, but must be left intact enough for facilitating a receipt of user monies, or user participation with respect to the goal of the spammer. So one thing that is common to essentially all spam messages is some parameter that directs a recipient's participation, whether it be it a postal address for sending money, a URL for directing recipients to a Web-site, a telephone number to call, or some combination of the above.
Some state-of-the-art spam filters can remove an impressive percentage of Spam mail before the mail is deposited into a user inbox, up to 90% or more in some cases. However spammers, knowing that a good percentage of their mails will be intercepted before reaching a user, typically simply increase the numbers of messages originally sent to insure that the 5 or 10% that make it through remain an adequate amount for their purposes. In a given spam campaign, the actual messages themselves are often altered slightly from message to message so that there are differences among messages in a same batch. In this way spammers increase their percentage figures of mails that ultimately escape the filtering process.
A drawback to commercial spam filtering processes is that often a percentage of non-spam messages are identified as spam. Likewise, many commercial solutions and consumer-based solutions are often manually trained and fine-tuned on a continuing basis by a user or administrator (commercial) in order to achieve adequate filtering percentages. Information from a message that is detected by typical text filtering techniques require that those extracted tuples be pre-known to the system, for example, stored in an internal database for comparison.
Conventional filtering techniques do not eliminate the problems described above altogether. As such, in some cases, those unwanted messages that do get through must be manually sorted and deleted from regular message queues. On the other hand, messages that are not considered spam may be marked as such, which may require manually locating those messages. The time required per user to deal with even a small percentage of spam messages that escape filtering or desired messages that are mistakenly filtered, if multiplied by a large number reflecting the number of users working for a large company for example, becomes a noticeable financial problem for the company.
Therefore, what is needed is a system for filtering spam messages that overcomes some of the above-described problems.