Computerized categorization of text documents has many real world applications. One example is enabling a computer to filter email messages by detecting the messages that are relevant to the categories of interest to the receiver. Another example is news or message routing, wherein a computer can route messages and documents to the recipients that deal with the details relayed in the messages. Other applications are automatic document organization and automatic information retrieval. Search engines can use computerized categorization to parse a query and to find the most related responses.
The standard approach for computerized categorization is to build a classifier engine from a large set of documents that is referred to as a training set. The training set contains a collection of documents that were previously categorized, for example by human reviewers. Typically a set of categories is defined and the reviewers determine which category or categories each document belongs to. The categories may be distinct or may be interconnected, for example the categories may have a hierarchical structure, wherein categories are subdivided to subcategories. An example of such a set is Reuters-21578.
The words in the documents of the training set are analyzed to form a feature vector for each document, which provides the words of the document and their corresponding frequency of use in the document. An induction algorithm is applied to the collection of feature vectors representing all the documents of the training set to produce a categorization engine referred to as a classifier. The classifier's job is to accept as input a document feature vector created for a specific document and provide as output the category or categories with feature vectors having the closest match to the input document vector. Typically, the classifier employs various statistical algorithms such as SVM (support vector machines) or KNN (k nearest neighbors) to determine the closest matching vectors and the required categories.
In a general application a computer is provided with documents as input and is required to determine the category or categories that the document is most related to. The documents are handled by the computer as a bag of words (BOW) without any specific order. The computer analyzes the words appearing in the received documents and produces a document vector representing the content of each document. The document vector is provided to the classifier produced from the training set to determine which category or categories have feature vectors with the closest match to the feature vector of the received document. “Machine Learning in Automated Text Categorization” by Fabrizio Sebastiani published in ACM Computing Surveys, 34(1):1-47, 2002 is an example of a publication which describes computerized categorization as described above.
When comparing categorization results of computerized methods with the optimal categorization results desired, it has been found that the computerized methods have reached a performance barrier due to the lack of world knowledge. Typically a human classifier uses knowledge external to the words in the document to classify the document, for example recognition of the name of a company and what the company deals with, or recognition of the name of a person and the ideas the person presents. Additionally, a human user uses the context and usage of a word to solve problems related to multiple meanings (e.g., polysemy), for example does the word “jaguar” refer to an animal or a car. Similarly, a human user uses external knowledge to handle issues of synonymy, for example to determine that two documents belong to the same category although they use different terminology.
Various attempts have been made to enhance the ability of the bag of words computerized method. Sam Scott in a computer thesis submitted to the University of Ottawa, Ontario, Canada in 1998, titled “Feature engineering for a symbolic approach to text classification” describes the use of “WordNet” (an electronic thesaurus from Princeton University) to add synonymous words and hypernyms to the feature vectors of the input documents and to the feature vectors of the training set documents in order to enhance recognition of words related to the words used to describe a category. In WordNet each word is provided with a list of words that are synonymous or super-ordinate or sub-ordinate to the specific word. It should be noted that some of the words introduced by WordNet may have meanings, which have the context of the document being analyzed and some of the words may be unrelated to the context of the document being analyzed. In the thesis Scott explains that the addition of words from WordNet did not provide the expected improvement in classifying documents with the Reuters data set (e.g., page 86—“why didn't it work?”).