1. Field of the Invention
The present invention relates generally to data processing, and more particularly but not exclusively to anti-spam and web page identification.
2. Description of the Background Art
E-mail provides a convenient, fast, and relatively cost-effective way of sending messages to a large number of recipients. It is thus no wonder that solicitors, such as some advertisers, use e-mail to indiscriminately send messages to e-mail accounts accessible over the Internet. These unsolicited e-mails, also referred to as “junk mail” or “spam”, are not only a nuisance, but also translate to lost time and money as employees or home users are forced to segregate them from legitimate e-mails. Anti-spam programs and services are commercially available to help users identify and remove spam. Some anti-spam programs use heuristic rules to identify spam. The rule creator examines multitudes of spam, find patterns among them, and create rules that look for the patterns in received e-mails. Finding patterns require grouping of the same or similar spam, which is a time consuming process that often requires manual intervention. Identification of web pages involves similar tasks, and thus faces similar problems, as spam detection. Therefore, techniques for facilitating grouping and identification of documents, such as e-mails and web pages, are generally desirable.