Separating spam from legitimate email typically involves a statistical analysis of each email message, to assess the likelihood of whether or not it is spam, based upon extracted message features. A spam classification program is trained by using both a spam corpus and a non-spam, or “clean,” corpus to determine feature set probabilities. Pre-determined features are extracted from corpora email messages and used to train the classification engine, which becomes sensitive to relative feature differences between spam and clean training messages. Then, during execution, the spam classifier extracts features from unclassified email messages, and computes relative likelihoods that the extracted features indicate that the message is spam versus clean.
Typical classification techniques produce some form of numerically continuous likelihood ratio or spam confidence factor which contrasts the likelihood that the extracted feature vector originated from a spam message versus the likelihood that the extracted feature vector originated from a clean message. This likelihood ratio is then compared against a decision threshold to produce the final discrete classification of spam or clean. More specifically, in current practice, the email message's numerically continuous spam likelihood ratio, let's call it L(msg), is compared to the decision threshold, let's call it th, and a decision is made by a simple rule of the form:
if L(msg)>th then “msg is spam” else “msg is clean”
The decision threshold, th, may have been determined during training, or it may have been set by a user or administrator through a user interface. Either way, in current practice the threshold value is static. This is easily verified for any specific spam filter, since a given email message will always be classified by it as spam or as clean, independent of the relative mix of spam and clean email messages in the email stream being processed. This may seem intuitive, since if a human were shown a sample email message and asked if it were spam or clean, she typically would not ask to first study, say, the last thousand email messages that preceded this sample message, and then base her decision not only on the features of the sample email message, but also on the running statistics of its containing message stream.
However, it can be shown statistically that a static threshold will only produce good classification performance where the relative proportions of spam and clean email messages remain equivalently fixed. Given the various deployment environments and the variability of email message streams and spamming activity, it is very unlikely that any fixed assignment of threshold value will produce optimal or near-optimal classifications. Further, it is unlikely that a non-expert in statistical decision theory could enter an optimal threshold value, or that even an expert in statistical decision theory would have the available data to make an optimal threshold setting. It can be further shown statistically that overall better classification decisions will be made if the decision threshold accounts for the statistical properties of the email message stream being filtered.
What is needed are methods, systems and computer readable media for dynamically and automatically adjusting a spam classification decision threshold in response to varying ratios of spam and clean email in a stream. Providing this functionality would improve spam classifier performance, reduce misclassification costs, lower administrative burden, and ensure more consistent user satisfaction across diverse deployment environments of varying traffic mixes.