An increasing proportion of communications, which have traditionally been carried out by means of paper documents, are now carried out by means of electronic documents. In many cases it is desirable to sort, classify, or group such documents. One example of a category of electronic documents that it is desirable to sort, classify, or group is that of unsolicited or “Junk” mail, which is an increasingly annoying problem that may consume a considerable amount of an e-mail recipient's time to process. It also consumes networking bandwidth, server storage, and processing power to deliver.
There are a number of partial prior art solutions to this problem, in particular in the context of unsolicited messages. All of these solutions are based on some sort of logic that correlates messages due to the values or the semantics of some of their fields. The following list includes a set of such solutions:    US 2002/0116641—Method and apparatus for providing automatic e-mail filtering based on message semantics, sender's e-mail ID and user's identity:    U.S. Pat. No. 7,089,241—Classifier tuning based on data similarities;    U.S. Pat. No. 6,996,606—Junk mail rejection system;    U.S. Pat. No. 6,868,436—Method and system for filtering unauthorized electronic mail messages;    U.S. Pat. No. 7,016,939—Intelligent spam detection system using statistical analysis;    U.S. Pat. No. 6,769,016—Intelligent spam detection system using an updatable neural analysis engine;    U.S. Pat. No. 6,732,157—Comprehensive anti-spam system, method and computer program product for filtering unwanted e-mail messages;    U.S. Pat. No. 6,507,866—e-mail usage pattern detection;    U.S. Pat. No. 6,484,197—Filtering incoming e-mail;    U.S. Pat. No. 6,453,327—Method and apparatus for identifying and discarding junk electronic mail;    U.S. Pat. No. 6,421,709—e-mail filter and method thereof;    U.S. Pat. No. 6,393,465—Junk electronic mail detector and eliminator;    U.S. Pat. No. 6,249,805—Method and system for filtering unauthorized electronic mail messages;    U.S. Pat. No. 6,199,103—Electronic mail determination method and system and storage medium;    U.S. Pat. No. 6,161,130—Technique which utilizes a probabilistic classifier to detect junk e-mail by automatically updating a training and retraining the classifier based on the updated training set;    U.S. Pat. No. 6,112,227—Filter-in method for reducing junk mail;    U.S. Pat. No. 6,023,723—Method and system for filtering unwanted junk e-mail utilizing a plurality of filtering mechanisms;    U.S. Pat. No. 5,999,932—System and method for filtering unsolicited electronic mail messages using data matching and heuristic processing;    U.S. Pat. No. 5,619,648—Message filtering techniques;    GB 02347053A—Proxy server filters unwanted emails;    EP 00813162A2—Method and apparatus for identifying and discarding junk electronic mail;    EP 00720333A2—Message filtering techniques;    IPCOM000016360D—Methodology for Automatic Mail processing;    IPCOM000020428D—Spam Bot Email Evader (SPEE);    IPCOM000137923D—The method for avoiding the needless mail; and    The Tumbleweed MailGate Product Suite—The processing of image content from a message, which determines to be unsolicited if it contains an image or is sent as an image that is similar to a previously identified image in a junk mail message.
More recent spamming techniques which are not satisfactorily handled by prior art techniques exhibit the following characteristics:
1. A massive number of email addresses used for sending spam mails;
2. Different domains used for sending spam mails;
3. Different sending server machines;
4. Different subject; and
5. Different textual content.
None of the above solutions are able to handle this style of spamming. Further, junk mail attacks are becoming more fierce with the introduction of specialized service providers that initiate different campaigns at the same time for different advertising clients, and consequently, different textual content all the time. Hence, there is a need for a complementary method that is textual content-independent, semantics-independent, and field-value-independent.