Document clustering and classification techniques can provide an overview or identify a set of documents based upon certain criteria, which amplifies or detects certain patterns within its content. In some applications these techniques lead to filtering unwanted email and in other applications they lead to effective search and storage strategies. An identification strategy may for example divide documents into clusters so that the documents in a cluster are similar to one another and are less similar to documents in other clusters, based on a similarity measurement. One refers to the process of clustering and classification as labeling. In demanding applications labeling can greatly improve the efficiency of an enterprise, especially for storage and retrieval applications, provided that it is stable, fast, efficient, and accurate.
Users of information technology must effectively deal with countless unwanted emails, unwanted text messages and crippling new viruses and worms every day. This largely unnecessarily high volume of network traffic decreases worker productivity and slows down important network applications. One of the most serious problems in today's digital economy has to do with the increasing volume of spam. As such, recipients of email as well as the service providers need effective solutions to reduce its proliferation on the World Wide Web. However, as spam detection becomes more sophisticated, spammers invent new methods to circumvent detection. For example, one prior art methodology provides a centralized database for maintaining signatures of documents having identified attributes against which emails are compared, however, spammers now modify the content of their email either slightly or randomly such that the message itself may be intelligible, but it evades detection under various anti-spam filtering techniques currently employed.
Currently, at least 30 open relays dominate the world, bursting messages at different rates and different levels of structural variation. Because certain types of email mutate or evolve, as exemplified by spam, spam-filtering detection algorithms must constantly adjust to be effective. In the case of spam email, for example, the very nature of the spam corpus undergoes regime changes. Therefore, clustering optimality depends heavily on the nature of the data corpus and the changes it undergoes.
The objective of an effective detection of documents having identified attributes or email classification schemes is to find similar messages. Many clusters of email represented by spam, e-vites, mailing lists; emails are forwarded many times within the enterprise; and targeted mailings from websites. What the enterprise then does with these clusters is left to the discretion of the enterprise. As such, it is essential to define a clear notion of metrics in the space of the clusters. In plain words, given two electronic data streams associated with a document, a system must be able to produce a number referred to as “the distance” that describes in some meaningful way how similar or close two messages are.
To work effectively as a detector for different categories of email, a classifier must establish the parameters of the distance function and the threshold carefully. For instance, if the threshold is too high, the classifier will produce too many false positives. The same can happen if the threshold is chosen unreasonably low.
Generally the choice of a metric is dictated by the choice of classifier or filter. In regards to filters, given two messages, X and X′, a system can generate electronic data streams or signature arrays, (f1, f2, . . . , fn) and (f′1, f′2, . . , f′n) where n is the number of filters.
Comparing these two arrays requires specification of a metric and a threshold for each filter in the set. Thus, for example, the two messages belong to the same spam class, if and only if all of the following conditions hold simultaneously:d(f1, f′1)<τ1 d(f2, f′2)<τ2 d(fn, f′n)<τn where the τn is the numerical threshold values suitably chosen for each of the filters and d represents a function of the distance between two filter values.
The Nilsimsa Signature (“NS”) method counts 3-gram occurrences with the same hash values in the message body of an email and serves as an example of one spam detection schema [see, http://ixazon.dvnip.com/.about.cmeclax/nilsimsa.html]. NS ignores text mutations because they are deemed statistically irrelevant. Given a message, NS produces a sequence of 64 hexadecimal numbers. For instance, if NS consisted only of four bits, the distance between the two codes 1001 and 0001 would be 1 because only the first bits of the two signatures differ. An important aspect of NS is that changing the input text does not change all of the entries in the code. In the method of the present invention, small changes to the input text causes all of the entries in the code to change.
One method to detect spam creates a hash value referred to as MD5, which is found to be relatively ineffective as a means for identifying spam because once a message is changed slightly, the hash value changes significantly. Although these methods work effectively for identifying absolutely identical messages, the nature of spam detection evasion means that the senders of spam will continue to incorporate differences that will produce significantly different MD5 outcomes (a simple permutation of two letters in a “pure” message will render an MD5 check completely useless for purposes of spam detection).
Using several filters reduces false positives by subjecting email messages to more scrutiny. If each of its 64 values were considered a single filter, NS could be viewed as a set of filters. A metric might then be defined as the number of differing bits for each separate number. The drawback of such a method is that the distance thus defined cannot exceed four, because the numbers are hexadecimal. Each filter by itself is quite primitive. The method might be made more robust by increasing the radix as achieved through changing the algorithm or by joining adjacent filters into an overall larger filter and thus decreasing the number of filters.
In addition to deploying several filters, a problem persists in that it remains a requirement that thresholds be chosen. Unfortunately, no systematic approach exists to choose the threshold even for one filter other than through an heuristic process of visual inspection of how emails are tagged as similar to one another and through trial and error as to what produces an acceptable detection of documents having identified attributes. When several filter values must be taken into account, filtering depends on all the thresholds and, therefore, finding the optimal thresholds through such a trial and error process presents a formidable undertaking. Three solutions are possible: (1) hard-code the thresholds into the software; (2) perform optimization checks manually and regularly; and/or (3) perform optimization automatically.
To succeed with the first solution one has to solve the optimization problem based on the current spam trends. However, even if it were to succeed in filtering a certain type of email today, no guarantee exists that it will be adequate tomorrow. The second solution may prove laborious. However, the inventors have determined a solution to the optimization problem for certain types of email recognition, which alleviates the shortcomings of the prior art by utilizing a combination of learning machines and special classes of uniform filters and stackable hash that allows a systematic investigation, determination and optimization of thresholds to compare the similarity or identity of electronic data streams and place them into clusters or classifications. As pertains to the classification and clustering of electronic documents, the invention further strives to minimize preprocessing time and execution time of the computer processes while maximizing the stability and accuracy of results.