1. Field of the Invention
The present invention relates to identification of spam in the text of email messages and, more particularly, to identification of spam in emails using algorithm based on histograms and lexical vectors (one-pass algorithm).
2. Description of the Related Art
Spam emails have become a veritable scourge of modern email systems. It has been estimated that as much as 80-90% of Internet email traffic today is of a spam nature. Spam, in addition to being annoying and wasteful of the time of the recipient, places considerable burden on large email service providers, and on corporate networks. For a ‘regular user,’ the ‘cost’ of spam that gets through is the few clicks that it takes to delete the offending message. For large scale email providers, such as Google, Yahoo, Microsoft, as well as for large corporations that have their own server-based solutions for spam filtering, handling spam is a problem that needs to be solved on an industrial scale. For example, such large mail service providers need to filter millions of spam messages every hour.
One phenomenon observed recently is the increasing professionalism of spam generators. Many of the techniques used by spam generators closely mirror, and borrow from, techniques used by professional virus writers. It has been estimated that at any given moment, millions computers connected to the Internet are ‘zombified’. In other words, these computers spew out vast numbers of spam emails, even though the owners of these computers are unaware of this.
Although in the early days of the spam ‘epidemic,’ it was possible to filter spam by looking for certain keywords, such as ‘Viagra,’ ‘Hoodia,’ ‘free offer’ and so on, modern spam has evolved far beyond such simple and easily filterable examples. Also, particularly for large email service providers and corporate email servers, such spam filtering needs to be done more or less on-the-fly, or within at most a few seconds—it would be unacceptable if the spam filters delay receipt of the email by any significant amount of time.
Generally, in the industry, a false positive is regarded as a much greater evil than letting through some number of spam messages, since very often, an email that was falsely identified as spam by the spam filter will never be seen by its intended recipient, or, at best, would be seen much later.
In general, many present methods for spam identification have not been fully successful. For example, attempts to work with filters for sorting out source addresses of bulk email distributors have not been proven successful. The spam filters also require a heavy monitoring burden to keep bulk mailer listings up to date. Similarly sorting out the emails by key words can only be partially successful as new mailers and new messages can avoid or obfuscate the key words.
As mentioned above, a spam cure can be worse than the disease when an intended recipient does not get an important email, because it is incorrectly identified as spam. Accordingly, there is a need in the art for an effective and precise method of identifying spam text in the emails by using a fast and efficient one-pass algorithm.