Electronic messages have become an indispensable part of modem communication. Electronic messages such as email or instant messages are popular because they are fast, easy, and have essentially no incremental cost. Unfortunately, these advantages of electronic messages are also exploited by marketers who regularly send out unsolicited junk messages. The junk messages are referred to as “spam”, and spam senders are referred to as “spammers”. Spam messages are a nuisance for users. They clog people's inbox, waste system resources, often promote distasteful subjects, and sometimes sponsor outright scams.
Personalized statistical search is a technique used by some systems for detecting and blocking spam messages. Personalized statistical searches typically depend on users to sort the messages into categories. For example, the users may put spam messages into a junk folder and keep good messages in the inbox. The spam protection program periodically updates the personalized statistical searcher by processing the categorized messages. When a new message comes in, the improved statistical searcher determines whether the incoming message is spam. The updating of the personalized statistical searcher is typically done by finding the tokens and features in the messages and updating a score or probability associated with each feature or token found in the messages. There are several techniques that are applicable for computing the score or probability. For example, if “cash” occurs in 200 of 1,000 spam messages and three out of 500 non-spam messages, the spam probability associated with the word is (200/1000)/(3/500+200/1000)=0.971. A message having a high proportion of tokens or features associated with high spam probability is likely to be a spam message.
Personalized statistical searches have been gaining popularity as a spam fighting technique because of several advantages. Once trained, the spam filter can detect a large proportion of spam effectively. Also, the filters adapt to learn the type of words and features used in both spam and non-spam. Because they consider evidence of spam as well as evidence of good email, personal statistical searches yield few false positives (legitimate non-spam email that are mistakenly identified as spam). Additionally, the filters can be personalized so that a classification is tailored for the individual. However, personalized statistical searchers also have several disadvantages. Since their training requires messages that are categorized by the users, they are typically deployed on the client, and are not well suited for server deployment. Also, classifying email messages manually is a labor intensive process, therefore is not suitable for deployment at the corporate level where large amounts of messages are received. It would be desirable to have statistical searches that do not depend on manual classification by users, and are suitable for server deployment and corporate level deployment.