This invention is related in general to processing of digital information and more specifically to the sending, delivery, analysis and other processing of electronic mail (email) messages.
Although email has become immensely popular and is a huge benefit for many users, today's email systems are also plagued by increasing numbers of unwanted mail, referred to as “spam.” Spam email has reached such large proportions with respect to desired email that systems are now sought to defeat the sending and delivery of spam. Typically email is transferred over networks such as home or small-area networks, local-area networks (LANs), wide-area networks (WANs) and, ultimately, global networks such as the Internet. Although email represents the most popular general information exchange mechanism, the problem of unwanted information can include any type of information transfer over a digital network such as instant messaging, chat, newsgroups, file transfers, etc.
Spam is often difficult to detect because, in a broad sense, it is merely information that a recipient does not want. The analysis of an email message can attempt to determine the contents and meaning of a message, quantity of a message (i.e., whether it is a “bulk” message), sender, recipient, delivery path, and other characteristics in order to classify a message as spam. However, spam senders, or “spammers,” are aware of such analysis techniques and use different tactics to make messages difficult to analyze automatically. Such “obfuscation” is designed to convey one message to a human reader but provide a different representation to a process executing on a machine. For example, to prevent certain words from being recognized by a process yet still be recognized by a human one tactic is to use slightly different spellings of the word such as “viagaraaa” instead of “viagra”. Another tactic is to include invisible character codes in a message so that the character codes do not result in any visible effect on a displayed message yet appear as characters that are taken in to consideration by an analysis process.
Thus, it is desirable to provide features for text and message analysis that work effectively even on obfuscated text and messages.