1. Field of the Invention
The present invention relates generally to the classification of text on the Internet and, more specifically, to the detection of spam.
2. Description of the Related Art
Content available on the Internet has varying degrees of reliability and relevance to users. Organizing this information is relatively challenging, as an enormous amount of content is published on websites, with new content being published at a very high rate. Faced with this volume of information, users have increasingly come to use services, such as search engines and content aggregators that classify content. Included in such classification is detecting and filtering spam appearing in websites, user comments, and business listings. To detect spam, these services often examine content for certain strings of characters known to correlate with spam, designating, for instance, a website with the text “cheap prescription drugs” as likely spam designed to fill search engine results with content of low relevance to users.
In response to these keyword-detection techniques, spammers have modified their approach. Some spammers publish text designed to convey different messages to a human reader and a computer by exploiting similar appearing characters in different character sets, for example, by relying on letters in the Greek or Cyrillic alphabets that appear similar to those used in the Latin alphabet. By replacing Latin characters with similar appearing Greek or Cyrillic characters, the spammer creates text that conveys the intended message to a human being, while failing to match the text patterns by which spam is detected. Thus, a spammer might convey the message that their website has cheap prescription drugs by replacing the Latin “a” in the word “cheap” with a Cyrillic letter “a,” which will cause the text to not match to a regular expression designed to match the word “cheap.”