There has been a great deal of work on automatic content-based document search techniques and document classifiers for various applications. Some examples of prior art document classification mechanisms are listed below.
In U.S. Pat. No. 5276741, entitled "Fuzzy string matcher," an algorithm compares strings into which error has been introduced, using a measure of approximate similarity. However, the types of errors introduced do not include different orderings of the original message.
In U.S. Pat. No. 5375235, entitled "Method of indexing keywords for searching in a database recorded on an information recording medium," the matching technique employs a similarity measure, based on keyword-frequency. Knowledge of likely keywords on the part of the sending parties will lead them away from such word choices when they want to present the information against the wishes of the receiver. Therefore, this method can not be used against knowing senders who want to avoid matching.
In U.S. Pat. No. 5418951, entitled "Method of retrieving documents that concern the same topic," the document characterization algorithm uses a word n-gram weighting method. The method has the same problem seen in Patent '235. If the second party rearranges the message, the message characterization mechanism fails.
In U.S. Pat. No. 5276869, entitled "System for selecting document recipients as determined by technical content of document and for electronically corroborating receipt of document," creates profiles of documents for matching against profiles of documents of interest to a potential receiver. However, the method assumes a limited range of document types, specifically disclosures of inventions.
In U.S. Pat. No. 5701459, entitled "Method and apparatus for rapid full text index creation," a full text index creation algorithm is used. The method assumes no capricious or evasive reordering or rewording of text to evade searches.
In U.S. Pat. No. 5469354, entitled "Document data processing method and apparatus for document retrieval," a search method that breaks the document into shorter character strings that are used to build an index is used. However, the method is not sensitive to common phrases, and is an optimization technique for phrase-level searching, rather than a searching technique.
In U.S. Pat. No. 5107419, entitled "Method of assigning retention and deletion criteria to electronic documents stored in an interactive information handling system," varying criteria for deletion are suggested. However, the method requires user response and input, and is not automatic.
In U.S. Pat. No. 5613108, entitled "Electronic mail processing system and electronic mail processing method," documents are classified and located within a file system by attempting to automatically determine the type. However, the method assumes that the senders are not trying to subvert the classification system.
Generally prior art document classification approaches start from common assumptions about the motivations for content-based document search, index and retrieval. These motivations, found on the publishing side and on the consumption side, are listed below.
The indexing and search techniques reflect, for the most part, searcher subject-matter interests, and a desire to find a uniquely fitting subset of documents. Most prior art document classifications systems are not designed to avoid unsolicited documents, or to determine whether or not a given document is truly unique.
Generally, the information provider, out of concern for managing costs, maintaining profitability, and/or maintaining a reputation for courtesy, strongly desires that the document reach only interested audiences. The information provider therefore uses the automatic indexing service to improve the chances of the document being automatically identified by such audiences. Generally, the information provider is not interested in providing many copies of the document with insignificant variations, automatically or otherwise. Such copies could be taken by searchers as frivolous reproduction of essentially the same information, a cost in consumer time, and would require resources, such as disk space, that the publisher has to pay for.
Generally, prior art document classification systems assume that the documents have relatively little time-value, in the sense that they are expected to be stored for purposes of retrieval for periods of years, and usually need not be indexed, promoted, and propagated immediately. While a timely response from the search and retrieval system is important for attracting and retaining users, there is no real-time response requirement, especially for generating document indexes. Most such systems need not index documents before some real-world event in order to be of real value to information providers and their client searchers.
Document source text with original information is assumed to be produced at human input rates. Usually prior art document search systems assume that there is a desire to make the documents available on networks and in computers using only the amount of redundancy needed for information integrity and user convenience.
There is an advantage to both publishers and searchers in using indexing schemes that are standard, consistent, and independent of time of search and particular physical repository. Indexing techniques that are opaque and variable according to time and place would defeat the purposes of interested parties. Indexing systems for document retrieval systems must be highly reliable, as a basic measure of their quality- of-service. In general, people would distrust a system that provided them with different, and incorrect, results at different times or from different sites, even occasionally.
Prior art on-line document retrieval systems still largely assume limits in both computer power and network bandwidth. The algorithms and technology still reflect these prevailing assumptions. In particular, since power and bandwidth resources were scarce, closed, and closely held, there was a low tolerance for conspicuously frivolous uses of them.
Since that time, however, dramatic improvements in computer power and network bandwidth have weakened a number of the above assumptions. The digital "information explosion" was made possible by the rapid growth of secondary storage, processing power, and public networking. But it has been followed by several kinds of "information pollution." Computers can duplicate and propagate information much more cheaply and quickly than human beings. This has always been true, of course, but it has not been until recently that these virtues have been inexpensive enough to also provide opportunities for people to inconvenience others.
In particular, there is now information in, or appearing via, computers that concerned parties can not easily avoid, however much they might wish to. Such information is unlikely to be efficiently indexed in any database. Some promoters wishing to reach interested audiences have found ways to actively present information to many people. They show little concern for the large numbers of uninterested parties they also reach in the process. An example of this is the embedding of popular-but-irrelevant keywords in invisible text on web pages, to increase hit ratios.
One prior art method of filtering "information pollution" is by comparing every suspect message word for word against a list of messages thought to be undesirable. Such an approach can, however, be easily frustrated by automating the production of minor changes to the text. Such changes might include changing the order of phrases, sentences and paragraphs, without changing their meaning. Such permutations can be made at a cost not significantly greater than that required for simply copying the text. Only a small amount of extra text preparation effort and simple software tools are required. Given motives to do so, text permutation is an obvious and easy step for those determined to reach audiences that are employing naive content-matching filters to reduce such material.
Since much of the information that constitutes "information pollution" is of little value compared to the time and resources consumed in propagating it, and these costs are not borne by the information providers for most part, the information typically has only time value, if any. Long term storage in databases is not a goal for its propagators.
Another prior art system indexes documents for active searches by interested users. However, such indexing can delay recently-received material. With "information pollution," the information source is typically intrusive or obstructive, since undesired information is often not immediately distinguishable from desired information. Therefore, indexing is not viable. Filtering of such information requires "real time" response--response within a short amount of time.
Existing document indexing, storage, and retrieval systems are designed under assumptions that are almost directly the opposite of any system that might be used to ameliorate "information pollution" problems.