To quickly search the vast amount of stored information available from computers, researchers have attempted to summarize that information using methods which automatically categorize information. Such summaries are automatically created by parsing the stored information itself, grouping the data therein into categories, and labeling these categories with terms to be understood by people wishing to know what is in the data.
One advantage of these automatic methods is that new categories can be created with small pre-existing taxonomies that partially cover the stored information so that the new categories will cover part of the stored information that has not been partially covered by the pre-existing taxonomies. Another advantage is that the category labels themselves serve to either remind people of useful search terms, or suggest ways in which the data itself can be succinctly described.
Both advantages are useful in search engine web portals and other similar user interfaces for people searching for information amid large data stores, such as the world wide web or other large data repositories. There are frequent occurring flaws in these automatic methods. One flaw is that the automatically generated categories may not correspond to major taxonomic branches, which people have in mind. For instance, an automatic categorization of “ears” might result in categories of “rental cars,” “new cars” and “used cars” whereas a person might have in mind a taxonomy of “sport-utility vehicle,” “sports cars,” “station wagons” and “sedans.”
Another flaw is that the automatically generated categories may be too closely related to major taxonomic branches, which people have in mind. For instance, an automatic categorization of “hybrid cars” might results in categories of “gas-electric” “electric-hybrid” “hybrid technology” all referring to aspects of the same mental concept.
Another flaw is that automatically generated categories may be too obscure, when compared to major taxonomic branches that people have in mind. For instance, an automatic categorization of “concept cars” might result in categories of coachbuilder, entertainment value and performance limits.
Yet another flaw is that the categories might match what people have in mind, but the automatically generated labels for those categories differ from what people have in mind. For instance, a person might have in mind a “cars” taxonomy of “sport-utility vehicle,” “sports cars,” “station wagons” and “sedans.” The automatically generated labels for these categories might be “dual-use,” “performance,” “tailgate” and “saloon.”
All of these flaws have severely limited the use of automatic categorizers, and most of these flaws are inherent in the use of statistical methods. It is well known that semantically significant phrases such as “constitutional rights” and “religious beliefs” are conveyed by their own and related statistically insignificant terms, and also that statistically significant terms such as “the” and “a” generally carry semantically trivial meanings. The latter are often added to stopword lists which exclude them from statistical sample sets. However, stopword sets appropriate for categorizing one set of documents may fail utterly for another set. For example, a stopword such as “judge” used for courtroom transcripts may be a semantically important for medical research documents. To gather and detect important semantic meanings conveyed by statistically insignificant terms, researchers have attempted to use pre-defined taxonomies to associate documents to pre-defined important meanings. However, the quickly evolving nature of language has rendered such pre-defined taxonomies to be inappropriate for most uses. For instance, mental cogitations of a person searching for slightly unfamiliar information generally involve terms unrelated to any pre-defined taxonomy. Only by creating a taxonomy on-the-fly of a person's seemingly unrelated search terms can the implicit goal the person has in mind be detected by an automatic categorizer.
Some of the attempts to build on-the-fly automatic categorizers have suggested use of a statistically defined feature list and an anti-feature list so that statistically prevalent labels are created which are statistically unlikely to be shared between categories. Such attempts include U.S. Pat. No. 6,938,025 issued on Aug. 30, 2005, and U.S. Pat. No. 6,826,576 issued on Nov. 30, 2004, both to Lulich et al., each of which is incorporated by reference herein. Although suppression of terms via an anti-feature list can be occasionally useful, overly broad statistical methods of feature extraction disclosed by Lulich inevitably lead to problems with terms improperly added to the anti-feature list, or terms improperly added to the feature-list. It will be obvious to one of ordinary skill in the art that the mental path a person traverses to categorize does not involve reading documents from front to back, while aggregating a massive hierarchy of feature-lists and anti-feature lists. A person would read documents accumulating for semantically significant covering terms, which cover significantly different semantic concepts. This emphasis on semantics means that rather than blindly compiling a single feature-list and anti-feature list for a given node in a pre-existing subject hierarchy, a person would create a more fluid and intelligent optimization around what defines a category in the feature-list and what cannot define a category in an anti-feature-list. This intelligence arises from making use of semantic connections which shift while the text is read and a person learns new semantics from the text. As seen in FIG. 3 of Lulich's patent, formation of a feature-list and anti-feature-list happens on a node-by-node basis for nodes within a subject hierarchy. This formation is created relative to feature-lists connected to sibling nodes of the subject hierarchy. However the content of feature-lists connected to other sibling nodes frequently is arbitrary with respect to any node of the subject hierarchy. As seen in FIG. 10 of Lulich's '576 patent, CHAOS and LOGIC are sibling nodes under MATH. If a person intelligently categorizes LOGIC, GEOMETRY and CALCULUS to be sibling nodes under MATH, the feature-list CHAOS as taught by Lulich will improperly limit and thus mischaracterize the meaning of LOGIC. In general, no pre-existing subject hierarchy can properly drive the entire categorization process. Instead, preexisting hierarchies can at best affect the weight given to specific terms relative to other terms. The formation of a feature-list and anti-feature list has to be initiated without the prejudices incurred by blindly following any pre-existing subject hierarchy. Thus, Lulich's '576 disclosure at best would result in awkward category formations when applied to general purpose text categorization.
A similar awkwardness would result from applying Lulich's earlier U.S. Pat. No. 6,938,025 to general purpose text categorization. By applying document-centered statistical methods to create a content-group and an anti-content group, statistically insignificant but semantically significant terms would fail to join either group, causing the categorization to form around semantically trivial terms. The only statistical remedy to flaws in Lulich's '025 disclosure is application of standard statistical stop word techniques, which as previously discussed cannot work for all sets of documents. In addition, the emphasis on document groups causes problems when a single document such as a literature survey document contains content from multiple categories. To properly categorize cross-over documents, automatic categorization has to coalesce around individual phrases, not individual documents.
In general, previous attempts to automatically categorize data have failed because of over-reliance upon: 1) statistical methods and associated methods of stopword lists and statistical distributions; 2) static taxonomic methods and associated methods of traversing out-of-date taxonomies or 3) methods centered around documents instead of semantics. In order to succeed with general purpose automatic categorization of data, non-statistical, dynamic taxonomic and semantic methods must be employed.