1. Field of the Invention
The invention generally relates to a text classifier for classifying a given text into a particular one or more of predetermined categories and, more specifically, to a method and system for generating and training (or optimizing) parameters for used in such a text classifier.
2. Description of the Prior Art
Text data stored in some computer-based systems are increasing in amount and variety day by day. Such stored natural language text data include academic theses, patent documents, news articles, etc. In order for the stored text data to be effectively utilized as information, it is necessary to classify each item of the stored text data into an appropriate category or categories. For this purpose, there have been proposed various types of text classifiers so far.
The present invention relates to a text classification technique, inter alia, of the type that uses a vector space. Vector space-based text classification techniques are disclosed in, for example:
U.S. Pat. No. 5,671,333 issued Sep. 23, 1997 to J. A. Catlett et al., entitled xe2x80x9cTraining apparatus and methodsxe2x80x9d;
U.S. Pat. No. 6,192,360 issued Feb. 20, 2001 to S. T. Dumais et al., entitled xe2x80x9cMethods and apparatus for classifying text and for building a textxe2x80x9d, which introduces a variety of classification techniques including the theory and operation of Support Vector Machines;
Japanese patent unexamined publication No. 11-053394 (1999), by N. Nomura, entitled xe2x80x9cDevice and method for document processing and storage medium storingxe2x80x9d; and
Japanese patent unexamined publication No. 2000-194723 (2000), by K. Mitobe et al., entitled xe2x80x9cSimilarity display device, storage medium stored with similarity display program, document processor, storage medium stored with document processing program and document processing methodxe2x80x9d.
All of references cited above are incorporated herein by reference.
In vector space-based text classifiers, an M-dimensional vector space is spanned by the basis comprised of a set of vectors V1, V2, . . . , VM corresponding to M words W1, W2, . . . , WM constituting a dictionary. An object or text to be classified is expressed in a point in the vector space. That is, a text or document to be classified is expressed as a feature vector (or document vector) which is a linear combination of the basis (V1, V2, . . . , VM). Each of the components of a feature vector of a given text is expressed by using the frequency of occurrences, in the given text, of a word associated with the component. Each of the categories in a category set into which an object text is classified is expressed by a reference vector defined for the category. Again, each reference vector is expressed in a linear combination of the basis (V1, V2, . . . , VM). The degree of closeness of a given text to a class or category is calculated by finding an inner product of the feature vector of the given text and the reference vector for the category, by finding a distance between the two vectors. Whether the given text belongs to the category or not is determined on the basis of the calculated degree of closeness.
The dimension of the feature vectors may be reduced by applying a lower rank approximation through the singular value decomposition to a document-word matrix obtained by arranging the feature vectors of the documents in a set of documents to be classified. Each component of such a dimension-reduced feature vector for an object document reflects not the frequency of a word itself but the extent to which the object document relates to a set of (weighted) words. In this case, mathematical operations such as distance calculations, inner product calculations and so on are possible in the same manner as in case of the original vector space.
A vector space-based classifier varies the result or the decision on whether a document belongs to a particular category depending on the reference vectors associated with respective categories and the magnitude (or threshold) of the degree of closeness within which magnitude the document is classified into the particular category. The components of the reference vectors and the threshold values of the degrees of closeness for all the categories of a set of categories are called xe2x80x9cclassification parametersxe2x80x9d. In order to achieve accurate classification, the classification parameters have to be properly determined or optimized.
In conventional parameter training, samples (i.e., documents selected for training) are classified by using a classifier with roughly determined initial classification parameters. Reviewing the classification result, classification parameters are modified. This trial-and-error process is iterated until satisfactory classification is reached. The modification of classification parameters is achieved either by an operator directly modifying the parameters him/her-self or by an operator correcting the classification results and the classifier recalculating the parameters through machine learning based on the operator""s corrections.
However, in directly modifying schemes, it is difficult for the operator to know which of a large number of parameters to modify and how to modify one or more parameters selected for modification. Also, in classification result correcting schemes, it is difficult for the operator to know which of a large number of classification results to correct. These difficulties make the classification parameter modification a time taking task, which does not necessarily yield desirable classification parameters.
The present invention has been made to overcome the above and other problems in the art.
What is needed is a classification parameter generating method and system for enabling the operator to train the classification parameters interactively and effectively through various data analysis and selection tools.
What is needed is a classification parameter generating method and system that can be used for the case where each of reference vectors for the categories is considered to point statistically distributed points instead of a fixed point.
What is needed is a classification parameter generating method and system capable of calculating hitting rates for the samples having been reviewed. The hitting rate is the rate of the number of documents whose CDOM and evaluated CDOM equal each other for the category Cr to the number of documents whose CDOM for the category Cr has been evaluated.
What is needed is a classification parameter generating method and system with sample set generating and expanding capabilities. What is needed is a text classifier that uses a plurality of sets of classification parameters.
What is needed is a text classifier for determining whether a given text belongs to a specified category.
According to the principles of the invention, a method of and system for generating a set of parameters for user in determining whether a given document belongs to a specified one of a plurality of predetermined categories is provided. The system comprises a set of documents, each document having an identifier (ID); a document data set containing a record for each document which record contains a document ID of the document and a feature vector representing features of the document in a predefined vector space; and a category data set containing a record for each category which record contains a category ID of the category, a category name and the set of parameters. The parameters include a reference vector representing features of the category in the predefined vector space and a threshold value determined for the category. In this system, a membership score indicative of whether the document belongs to the specified category is calculated for each document by using the feature vector of the document, the reference vector of the specified category and a threshold value of the specified category. An evaluation sample selection screen enables an operator to interactively enter various command parameters for selecting documents for which the calculated membership scores are to be evaluated. In response to an input of one of the command parameters, information useful for the selection of documents is visually presented to the operator. An evaluation value input screen shows selected documents and permits the operator to enter an evaluation value to each of the displayed selected documents. And, the entered evaluation values are reflected to the reference vector of the specified category.
The command parameters include a specification of one of selection criterions and the range of the selection criterion.
The evaluation sample can be selected by weighing the document distribution with a desired one of predetermined probability distribution functions.
A further sample selection is possible based on the selected evaluation samples. Further selection may be on the basis of the degree of similarity to a user-specified sample. A further selection may be implemented by extracting key words from the selected evaluation samples and making a search with the key words.
The evaluation sample selection is achieved by a comparison between the previous and current calculation results.
The quality of parameters is checked by the hitting rate in the calculated degree of membership (CDOM).
An inventive parameter training system is further provided with the features: the weighting based on variance analysis of the vector components; and the expansion of document set and/or category set.
In one embodiment, a different set of documents of a suitable number (say, the same number as the sample set 11) is selected from the actual document set to use for training at each cycle of training. In this case, each of the reference vectors is given as a distribution function. The degree of similarity is given as the probability that the document belongs to an area, within the distribution range of the reference vector for the category, defined by a preset threshold.
A text classifier which uses a set of parameters generated according to the present invention is also disclosed.