The number of unsolicited bulk emails (also known as “spam”) transmitted via the Internet has grown consistently over the past decade, with some researchers now estimating that more than 80% of email represents spam. Spam emails annoy consumers, consume precious network bandwidth and resources, and, in some cases, may be used as a vehicle for committing fraud.
Certain types of spam, including but not limited to advance-fee fraud (also known as “419 fraud”), are often typified by poor spelling, punctuation, and/or grammar. For example, an advanced-fee-fraud message may begin with the sentence “Queen Elizabeth of Englandd, in her generosity have noted the overflowing of her coffers.” In this example, the author of the message incorrectly added an extra space before the comma, used poor grammar (“Queen Elizabeth . . . have”), and misspelled the word England (“Englandd”).
While spam-detection software may detect any one of the above mistakes fairly simply, a number of real-world implementation pitfalls may prevent spam-detection software from accurately detecting such errors without producing false positives. For example, legitimate emails often include misspelled words (e.g., “judgement” [sic] vs. “judgment”). Similarly, the punctuation used in various elements of a legitimate email (e.g., URLs such as “live.in” and numbers such as “4,000”) may, despite their accurate usage, appear incorrect within the context of English words and phrases. In addition, conventional spam-detection software may incorrectly classify non-English text (e.g., “hola”) as misspellings, potentially producing false positives. As such, the instant disclosure identifies a need for systems and methods for extracting suitable text signatures from a spam message in order to accurately and reliably identify future instances and/or variations of the spam message.