Automatic text categorization is the activity of automatically building automated text classifiers using machine learning techniques. These are systems capable of assigning a text document to one or more thematic categories (or labels) from a predefined set.
Text classification or categorization is a problem in information science in which an electronic document or some quantity of text is assigned to one or more categories, based on its contents. In supervised document classification, some external mechanism (such as human feedback) provides information on the correct classification for documents.
Text classification has become one of the primary methods of organizing online information. Another notable use of text classification techniques is spam filtering which tries to discern email spam messages from legitimate emails. Other applications are also possible, some of which will be described below.
A variety of techniques for supervised learning algorithms have demonstrated reasonable performance for text classification including naïve Bayes, k-nearest neighbor, support vector machines, boosting and rule learning algorithms, and use of maximum entropy.
Maximum entropy is a general technique for estimating probability distributions from data. The principle in maximum entropy is that when nothing is known, the distribution should be as uniform as possible, that is, have maximal entropy. Labeled training data is used to derive a set of constraints for the model that characterizes the class-specific expectations for the distribution. Constraints are represented as expected values of “features,” any real-valued function of an example. A document is represented by a set of word count features. The labeled training data is used to estimate the expected value of these word counts on a class-by-class basis. Improved iterative scaling finds a text classifier of an exponential form that is consistent with the constraints from the labeled data. Entropy is described, for example, in Schneier, B: Applied Cryptography, Second edition, page 234, John Wiley and Sons.
Prior work has been performed on building classifiers for call classification based on transcriptions of the complete calls. (See Tang, M., Pellom, B., Hacioglu, K.: Calltype Classification and Unsupervised Training for the Call Center Domain, Proceedings of the Automatic Speech Recognition and Understanding Workshop, St. Thomas, US Virgin Islands, Nov. 30-Dec. 4 (2003), pp. 204-208, incorporated herein by reference). In call classification, the whole call (document) is collected before a decision is made.
Prior work has also been performed on routing customer calls based on the customer response to an open ended system prompt such as “Welcome to xxx, How may I help you?” (See Kuo, H.-K. J., Lee, C.-H.: Discriminative Training of Natural Language Call Routers, IEEE Trans. on Speech and Audio Processing 11 (1) (2003), pp. 24-35, incorporated herein by reference). In call routing, the whole customer utterance (sentence) is collected before the classifier makes a decision. Manually classified past utterances are used to train the classifier and the new calls are classified/routed based on this classifier.
In cases where the class distribution does not remain stationary, it has been proposed to use incremental learning to learn the non-stationarity of online data streams. (See Katakis, I., Tsoumakas, G., Vlahavas, I.: Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams, ECML/PKDD-2006, International Workshop on Knowledge Discovery from Data Streams, Berlin, Germany, (2006), pp. 107-116, incorporated herein by reference). The approach usually taken to capture non-stationarity is to take a fixed size or adaptive size time window or weigh the data depending on age or relevance. The size of the window tries to balance the adaptivity and generalization of the classifier.
In pure incremental learning, a dynamically adjustable window is maintained during the learning process, and whenever there is a concept drift, old instances are forgotten by altering the window size and the window size is again fixed when the concept seems to be stable (See Widmer, G., Kubat, M.: Learning in the Presence of Concept Drift and Hidden Contexts, Machine Learning 23(1) (1996), pp. 69-101, incorporated herein by reference).
The concept-adapting very fast decision tree learner (CVFDT) applies a very fast decision tree learner (VFDT) to build the model incrementally using a sliding window of fixed size. (See Hulten, G., Spencer, L., Domingos, P.: Mining Time-Changing Data Streams, Proceedings of International Conference on Knowledge Discovery and Data Mining (2001), pp. 97-106, incorporated herein by reference; and Domingos, P., Hulten., G.: Mining High-Speed Data Streams, Proceedings of International Conference on Knowledge Discovery and Data Mining, (2000), pp. 71-80, incorporated herein by reference).
For an evolving data stream with event bursts, techniques to dynamically decide the window horizon to incorporate the long term or short term relevance of the data stream have been proposed (See Aggarwal, C., Han, J. Wang, J., Yu, P. S.: On Demand Classification of Data Streams, Proceedings of the International Conference on Knowledge Discovery and Data Mining, Seattle, USA, August (2004), pp. 503-508, incorporated herein by reference).
Support Vector Machines (SVMs) have been used with a dynamic window size in which the window size is adjusted so that generalization error is minimized. (See Klinkenberg, R., Joachims, T.: Detecting Concept Drift with Support Vector Machines, Proceedings of International Conference on Machine Learning, (2000), pp. 487-494, incorporated herein by reference).
The Incremental On Line Information Network (IOLIN) dynamically adjusts the window size and training frequency based on statistical measures. (See Cohen, L., Avrahami, G., Last, M., Kandel, A., Kipersztok, O.: Incremental Classification of Nonstationary Data Streams, ECML/PKDD-2005 International Workshop on Knowledge Discovery from Data Streams, Portugal, (2005), incorporated herein by reference).
In the systems and methods, the classifier generally forgets its past and learns the new distribution based on the changed concepts. It is not clear when such a classifier should kick in by starting to collect features and when the classifier should take a final decision.