Both for enterprises and individuals, there exists the problem of categorizing and storing the information documents they own. Especially for those enterprises which own a great deal of information documents and individuals who need to process various documents, it will certainly be advantageous to their working efficiency that these documents be stored orderly according to their categories. Now, many statistical categorization methods have been successfully applied in real world document categorization, such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree, Naive Bayesian, and etc. With these statistical methods, precision and recall of document categorization can reach to more than 85%.
With traditional document categorization technologies, before categorizing documents, a category tree is defined by a domain expert, and each category node in the category tree is defined with a training set of manually labeled documents. A corresponding categorizer is then constructed by utilizing the set of training documents. And finally, the documents to be categorized are automatically categorized with the categorizer. However, the precision of the traditional categorization method depends on the number and quality of training samples available in the training set.
In the article “A re-examination of text categorization methods”, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42-49), 1999 by Yiming Yang and Xin Liu, five statistical categorization methods, including SVM (Support Vector Machine), KNN (K-Nearest Neighbor), LLSF (Linear Least-Squares Fit), NN (Neural Network) and NB (Naive Bayesian) methods, have been tested. As recorded in the article, the tests with Reuters-21578 showed that for categories containing more training samples (more than 300 training samples), the precision and recall of the above methods is significantly good, while for categories containing fewer training samples (fewer than 10 training samples), the precision and recall of the above methods is quite poor.
In real practice, the distribution of training samples among various categories of a category tree is often not even, with some category nodes only having a few training samples. According to the statistics in the article, with the ApteMod version, the most common (commonly used) category is “earn”, having 2,877 training documents, but 82% of the categories have less than 100 instances, and 33% of the categories have less than 10 instances. As recorded in the article, the test results with the above methods showed that their performances are function of the training-set category frequency. For those categories with training document size less than 10, its macro-averaging F measure only achieves less than 0.2, while for those categories with training-set frequency more than 2000, its macro-averaging F measure can reach to 0.9 or more. From this, we can see that, in case of small training set, statistical methods cannot work very well.
Furthermore, all the above algorithms are based on a pre-defined and well-structured category tree, of which each category has been manually configured with tens or hundreds training samples. However, regardless of the sophistication of the pre-defined category tree, it is highly unlikely that any particular category tree defined by an expert can fully satisfy the degree of detail required by a user. In most cases, an ordinary user would treat a category tree as his file folder hierarchy in the hard disk, and hope to be able to manage the category tree in the same customized and personalized manner as a file folder. Therefore, a general application system should allow a user to arbitrarily define his personalized category tree, and in such a category tree, the user should also be allowed to introduce inconsistency in semantics. For example, at first, the user defines a sub-tree:
and wants to put documents related to IBM products into this sub-tree, i.e. put documents related to IBM PC into the category “PC” and documents related to IBM Server into the category “Server”. But, with the passage of time, the user may want to collect some documents about DELL PC into the category “PC”. However, this operation will introduce semantic inconsistency into this personalized category tree. Traditional categorization methods cannot introduce the documents about DELL PC of semantic inconsistency into the category “PC”, and thus cannot realize such a personalized category tree.
Therefore, a user may desire to be able to create arbitrarily a personalized category tree that is similar to his file folder hierarchy, and map freely a semantic structure that meets his demands onto this personalized category tree, without being limited by traditional semantic consistency, and at the same time, may also desire that there be no need to perform manually the work of specifying a great deal of training samples, which is lengthy, and time and energy-consuming, thereby realizing personalized document categorization that can satisfy personal needs.