The present invention relates generally to method and apparatus for determining whether an object containing textual information belongs to a particular category or categories. In addition, the present invention relates to the construction of a classifier that automatically determines (i.e., learns) appropriate parameters for the classifier.
Text categorization (i.e., classification) concerns the sorting of documents into meaningful groups. When presented with an unclassified document, electronic text categorization categorizes this document into separate groups of documents. Text categorization can be applied to documents that are purely textual, as well as, documents that contain both text and other forms of data such as images.
To categorize a document, the document is transformed into a collection of related text features (e.g., words) and frequency values. The frequency of text features is generally only quantified and not qualified (i.e., linguistically interpretable). In addition, current techniques for categorization such as neural nets and support vector machines use non-intuitive (i.e., black box) methods for classifying textual content in documents. It would therefore be advantageous to provide a text categorizer that qualifies the occurrence of a feature in a document and/or provides an intuitive measure of the manner in which classification occurs.