1. Field of the Invention
This invention generally relates to information storage and retrieval, and more particularly to a method for classifying information in a database using a dictionary and a decision tree model.
2. Description of the Related Art
Advances in computer database technology allow enormous amounts of information to be stored at one location. For all the problems this technology has solved in terms of convenience, it has also created a number of problems, specifically in the area of database information management.
A database is of no use if, when searched, relevant information cannot be retrieved from it. Increases in database capacity, therefore, have necessitated the development of companion technology for the efficient management of database information. Various text mining approaches have been developed for this purpose, one of which is known as text categorization.
Early text categorization systems were human-engineered knowledge-based systems. Because these systems require manual analysis in order to create a set of rules for classifying database information, they have proved to be impractical especially when applied to databases of large information. One such system is disclosed in Hayes et al., Adding Value to Financial News by Computer, Proceedings of the First International Conference on Artificial Intelligence Applications on Wall Street, pages 2-8, 1991. See also Hayes et al., TCS: A Shell for Content-Based Text Categorization, Proceedings of the Sixth IEEE CAIA, pages 320-326, 1990.
Automated text categorization methods represent a substantial improvement over their manual counterparts. Typically, these system use a computer-generated dictionary to form a set of rules for classifying database information. This dictionary is created by extracting patterns from a sample (or training) set of data, which often takes the form of a collection of electronic documents or other descriptive materials.
More specifically, in forming a dictionary, documents in the sample set are transformed into a standard model of features and classes. This is usually performed by encoding the documents in numerical form, which requires a transformation of text to numbers. A uniform set of measurements, or features, is then taken for each document. These features are used to create a dictionary, specifically by checking each document for the presence or absence of specific words or by determining a frequency of occurrence (i.e., count) of such words. These words are then associated with a number of known topics selected, for example, based on subject matter disclosed in the documents. Generalized rules are then formed from these dictionaries, which rules are then used to classify new documents to be added to the database.
Other methods use feature selection techniques to select a small subset of words that are deemed relevant to a particular topic. For example, the Reuters 21578 collection of newswire articles has for the year 1987 about 10,000 stemmed words. If feature selection is employed for a given topic, a much smaller subset of words (e.g., a couple of dozen words) may be formed for the topic. Feature selection methods of this type are disclosed in Yang, An Evaluation of Statistical Approaches to Text Categorization, Technical Report CMU-CS-97-127, School of Computer Science CMU, 1997; and in Lewis, Feature Selection and Feature Extraction for Text Categorization, Proceedings of the Speech and Natural Language Workshop, pages 212-217, February 1992.
Dictionaries used to perform automated text categorization may be stemmed or unstemmed. The words in a stemmed dictionary are mapped to a common root, e.g., the word "cars" will be mapped to the root "car". No such mapping occurs for words in an unstemmed dictionary.
Dictionaries may also be universal or local. A universal dictionary is one consisting of, for example, all stemmed words in the complete collection of documents in a database. Universal dictionaries have been the most widely used type of dictionary for performing text categorization. They have, however, in many instances, proven to be inaccurate and thus undesirable for purposes of classifying documents, especially for methods which develop rules based on decision trees.
Once a dictionary of words has been developed, a model may be formed for classifying new documents to be added to the database. To improve the accuracy of conventional text categorization methods, decision trees have been employed. Use of decision trees, however, has been shown to have at least one significant drawback, namely these trees tend to overfit classifications to the documents in the sample set. Various techniques have been developed to limit overfitting, such as tree pruning, but empirical experimentation has shown the effectiveness of these techniques wane as the dictionary grows larger. Neural networks and so-called nearest-neighbor methods used to reduce the effects of overfitting have also proved less than satisfactory, particularly because of large computational costs.
Most recently, improvements have been made to text categorization systems which use universal dictionaries. For example, new training variations have emerged that can find a simple scoring solution with a large universal dictionary. See Dagan et al., Mistake-Driven Learning in Text Categorization, Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997. Nearest-neighbor methods have been applied with a universal dictionary and a variation that learns thresholds to classify multiple topics in parallel. See Yang, A comparative Study on Feature Selection in Text Categorization, Proceedings of the International Machine Learning Conference, 1997. And, a new method called support vectors has been developed which fits linear, polynomial or radial basis functions. See Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Technical Report, University of Dortmund, 1997.
While each of these improved methods have outperformed earlier text categorization methods to a degree, they, too, have a number of drawbacks, not the least of which is that they operate in very high dimensions at possibly huge computation expense. Further, many of them employ unusual optimization techniques to their training documents. Perhaps most significantly, however, like earlier-generation text classification methods, their effectiveness for purposes of classifying database information is substantially diminished by their use of universal dictionaries.
A need therefore exists for a text categorization method which classifies database information with greater accuracy and less computational cost, and further which does so without using a universal dictionary and by using a decision tree model which is substantially less susceptible to overfitting effects.