1. Technical Field of the Invention
The present invention relates in general to the field of machine learning, and in particular to computer-based supervised classification of digital documents.
2. Description of the Related Art
In a supervised classification for a single class model, a knowledge base for calculating a relevant score for each category is created by a statistical method, such as the Naïve Bayes method. The statistical method creates the knowledge base by extracting a feature word from training documents that have been categorized in advance by a person. When the automatic categorization is performed, a relevant score of each category for an unclassified document is calculated from the knowledge base and the unclassified document is categorized into a category with the highest score.
With regard to the English language, processing on an uneven description of a normal form, a conjugation form, a singular form and a plural form is generally performed by the Lexical Analysis method, the POS Tagging method, or the Stemming method using a word dictionary. Feature words, such as a proper name, a general name, a verb, etc., are extracted and a relevant score of a category for a document is calculated from some non-functional words.
However, if words are extracted without any processing on the uneven description or any specification of the part of speech, the relevance of the featured words is weakened, making the credibility of the calculated relevant score lower. For example, if a new document includes the word “solutions” when the word “solution” is recognized as an important word in the training document of a category X, the presence of the word “solutions” in the new document is not reflected on the relevant score of the category X. This is because “solutions” and “solution” are not recognized as the same word.
If a knowledge base is created with words extracted by a simple method, the relevant score that is calculated when the automatic categorization is performed becomes vague. That sometimes leads to a case in which a category for a particular document with the second highest score, instead of the category with the highest score, is the optimal category.