In the past, many different systems of organization have been developed for categorizing different types of items. The items have ranged from material items, such as different types screws to be placed into different bins, to knowledge items, such as books to be placed in order in the Dewey Decimal System. For ease of understanding, the categorization of items will be described as the categorization of documents although it will be understood that all types of material and knowledge items are included in the term documents.
The earliest systems used manual assignment of documents to categories, for example, by human experts. This is currently the dominant method, which is used in libraries, as well as by popular Internet search engine companies. Some companies have warehouses of people who do the categorization.
The primary disadvantage of manual assignment is the fact that it is labor-intensive; it requires human resources approximately proportional to the number of documents that need to be categorized. In addition, it is somewhat error-prone and may lead to inconsistencies if people are assigning documents to categories based on different criteria, different interpretations of criteria, or different levels of expertise.
To be less subjective, rule-based assignment of documents to categories, including rules based on keywords, has been developed for use with computer systems. This approach uses rules such as “if the document contains the words ‘football’, and ‘goal’, and ‘umpire’” and not the word “national” then assign it to the category “local football”.
Mostly, human domain experts author these rules, possibly with the aid of keyword identification tools (such as word counters), to encode their knowledge. These rules usually are comprised of Boolean combinations of keyword occurrences (possibly modified by counts such as “the term ‘national’ used at least 5 times” then assign to “national baseball”). These rules can be executed automatically, so this solution can be used to automatically assign documents to categories.
The disadvantages of rule-based assignment are that the accuracy of these rules is often very poor; depending on the authoring of the rules, either the same document is assigned to many categories, including many wrong ones, or to too few, in which case documents do not appear in the categories they should. Another disadvantage is that the rules are difficult to author and maintain, and the interaction of the rules (so-called “chaining”) is difficult to understand (and debug), so that unexpected: assignments of documents to categories may occur. Also, this solution cannot take systematically take advantage of explicit statements about cost of mis-categorization. This method also has no way to give incrementally better categorizations.
Various straight multi-category categorization methods that ignore the topic tree or topic hierarchy have been developed. These methods take all the topics and sub-topics, and treat them as completely independent categories. A “flat” multi-category categorization algorithm, for example, Naïve Bayes or C4.5, is then applied to the flat multi-category problem.
The disadvantages of this solution are that it creates thousands of categories and it does not take advantage of the knowledge about dependencies among categories that is embedded in the topic hierarchy. Thus, it cannot take advantage of similarity in features among “close” categories while zooming in on the features needed to set those categories apart. Another disadvantage of this solution is that there is no version of this method that takes the structure of the hierarchy into account in weighing the cost of mis-categorization.
Another disadvantage of this solution is that it requires large amounts of training data (typically an amount directly proportional to the number of categories). Also, this solution does not compute incrementally more refined answers (allowing graceful degradation of performance if computation is interrupted, or allowing partial results to be shown along the way).
Another method is the level-by-level hill-climbing categorization (so-called “Pachinko machine” after the Japanese pinball machine). This method considers the topic hierarchy one level at a time. At each level, there is a categorizer that picks the category with the highest probability (or some alternative method of goodness) of fitting into a given category given the features of the document. Once the document is assigned to that category, there is a sub-categorizer that tries to assign the document to the sub-category of the category to which it has been assigned.
A disadvantage of this method is that it is a “hill-climbing” method—it makes a commitment to go down a branch in the topic hierarchy based on local optimality, and can easily get stuck in local optimal solutions that are not globally optimal. For example, it can go from football, to local football, and to local league when it should be in soccer.
A further disadvantage is that the level-by-level categorization does not address problems in which documents may legally be assigned to multiple categories or to categories internal to the hierarchy; nor does it take the structure of the hierarchy into account explicitly in weighing the cost of mis-categorization.
Another method is the level-by-level probabilistic categorization. It has been noted that a Naïve Bayes categorizer that works level by level and weights the categorization proposals of each subtree by its Naïve Bayesian probability estimate at the upper level is exactly the same mathematically as doing flat Naïve Bayes (the multi-category categorization) in situations where the exact same feature set is used at each categorization location.
Manual assignment and rule-based assignment cannot be fully automated and operated without requiring human authoring of rules.
Manual assignment and rule-based assignment cannot “learn”; i.e., self-teach when provided with training cases (documents for which known, correct, topics are available) or have the accuracy of the method improve with experience.
None of these methods can easily take into account the cost of mis-categorization, either measured or manually assigned, and none of these methods can take advantage of the hierarchical dependencies among such costs of mis-categorization.
Only manual and rule-based assignment can be used in topic hierarchies where documents may only be assigned to leaves or to ones in which they may also be assigned to interior categories, but these methods have other insurmountable limitations.
Only manual and rule-based assignment can be used in categorization problems where documents must be assigned to a single category as well as in ones in which any document may be assigned to one or more categories, but these methods have other insurmountable limitations.
None of these methods can incrementally target more promising solutions first (thus potentially eliminating unnecessary computation effort).
None of these methods allows a divide-and-conquer approach to categorization in topic hierarchy; that is, the global categorization problem cannot be split into multiple categorization sub-problems, with the advantage that each sub-problem can be solved in a more focused, specific manner and with more focused selection and use of document features, potentially leading to more accurate categorization.
These limitations appeared as an insurmountable categorization problem in the real world when dealing with a customer support operation where there were different documents that describe solutions to specific problems and an intuitive hierarchy of what category these problems should be in were known. There were printer problems, and computer problems, hardware problems and software problems. Then under hardware, the problem can be about huge mainframe computers or about small personal computers. And with the printers it could be about the laser jets or the small ink jet printers or the all-in-one FAX copier, and so on. There were about five million support documents and, most significantly, insufficient staff to categorize a full training set as required by previous methods.
Thus a solution to this problem is has been urgently required, has been long sought, and has equally long eluded those skilled in the art.