Document categorization is the problem of ranking documents with respect to their relevancy to a given category. For example, email documents may be ranked with respect to their relevancy to a particular category in order to provide a user with a list of relevant emails sorted from the most relevant to the least relevant. Furthermore, the emails documents having a ranking score that exceeds a given threshold may be labeled as “relevant” to a given category. Therefore, the email documents categorized as “relevant” to the category may be automatically moved or copied to a specified folder without intervention by the user. In one approach, the available categories are derived from a training set of documents with predefined relevance tags.
The Large Margin Perceptron Learning Algorithm (LMPLA) is one such approach, which has been known since at least 1973. See R. Duda and P. Hart, “Pattern Classification and Scene Analysis”, Wiley, 1973. An LMPLA is a simple algorithm for learning the parameterization of a linear classifier. That is, an LMPLA determines the weighting parameters for a linear classifier function ƒ(x)=Σiwixi that computes the relevancy of the document. The weighting parameters are determined from a training set of documents having known relevancies.
Despite its simplicity, the LMPLA is faster and less resource intensive than many more sophisticated text classifiers in use today. However, prior LMPLAs do not perform well for significantly unbalanced datasets (i.e., where the number of relevant documents is orders of magnitude larger than the number of non-relevant documents or vice versa), which is common in many real world applications. In addition, prior LMPLAs do not converge for all training sets.