1. Field of the Invention
The present invention generally relates to identification of content by metadata. The present invention more specifically relates to identification of spam content in electronic messages using metadata.
2. Description of the Related Art
Electronic mail is a commonly used mode of communication. Because electronic mail is relatively easy and inexpensive to use, it has also become a mode of delivering unsolicited, commercial messages (e.g., spam). While various anti-spam applications are available to lessen the impact of these unsolicited, commercial messages, there are a seemingly equal number of means by which the sender of such messages may circumvent such applications.
One application used to combat spam is the identification and quarantining of spam versus legitimate messages that an individual wishes to receive. Some anti-spam applications use “thumbprints” to identify spam; thumbprints are digital signatures used to represent a known spam message. A problem with using thumbprint signatures is that such signatures may be insensitive to changes in the spam.
For example, a signature may be developed for a particular spam message that includes contact information for an advertised service such as a phone number or an e-mail address. But if a portion of the spam message is subsequently altered (e.g., the phone number or the e-mail address is changed), the signature may no longer be useful to identify that message as spam. A spammer who wishes to circumvent such a system may simply alter some aspect of the message, thereby making any previous identifications of that message as spam more, if not wholly, inapplicable. As such, further, proactive identification of such a spam message at different locations may be fruitless.
The alteration may not necessarily have to be extensive to avoid identification. Misspelling or omitting certain words or introducing new information may not change the overall context or intent of the message, but may be more than enough to result in an altered thumbprint signature, making the message otherwise unidentifiable to a particular anti-spam application. As such, ‘spammers’ often engage in such techniques to circumvent anti-spam applications.
Images appearing in spam messages are particularly difficult to identify and are especially prone to changes. Minor changes to an image may alter the thumbprint signature, but still clearly convey the intended message. Changes to an image may include cropping, resizing, color variation, skewing, and adding random noise. Proposals on how to identify such images notwithstanding such changes have included extracting robust image features and/or using Fourier transformation and wavelet transformations. The success rate of these alternatives is debatable, especially in light of the fact that implementing such alternatives is complicated, time-consuming, and costly. There is a need in the art for identifying content as spam or an otherwise unsolicited and unwanted electronic message.