The computer implemented method and system disclosed herein, in general, relates to document classification. More specifically, the computer implemented method and system disclosed herein relates to classifying a document using confidence grades.
A large quantity of information that may be public or non-public is typically available to a user to facilitate the performance of a task. The information available to the user may often influence the user's decisions and course of action. The information may be in the form of text stored in a file or a document. The information may also be stored in a computer readable format. Most often, the information may not be classified, that is, the user may be uncertain about the content of the information that is available. The uncertainty about the information may require the user to manually access the complete information. In other instances, information accessible may be in the public domain. For example, a text file containing a description about a cricket bat may be classified into a class representing “cricket sport” instead of “sporting goods”. The classification may require a manual check of information causing unnecessary time consumption. Therefore, classification of such information prior to the access of such information by the user will be advantageous. The information may also need to be confidently classified into one of many classes. The classes may represent the type of data such as sports, science, politics, amusement, etc.
Classifiers using different theorems may be used for classifying the information. For example, the classifiers utilizing Bayesian theorem, linguistic methods, vector space model, etc., may be used to classify the document into one of the classes. The accuracy of the classifiers may be low. A Bayesian classifier utilizes the Bayesian theorem for text classification. Conventionally, the Bayesian classifier uses training documents for the classification. Each of the training documents is classified into one of the classes. An appropriate target class is determined for a document to be classified. Due to unavailability of perfect training documents in a few cases, the resultant classification given by the Bayesian classifier may be incorrect. The resultant classification may also occur at a lower position in array of the classes. Moreover, posterior probability calculated by the Bayesian classifier may lie close to 0 or 1. The posterior probability may not indicate appropriate relevance. The absence of appropriate relevance indication may require the user to verify results manually.
Hence, there is a long felt but unresolved need for a computer implemented method and system that classifies a document using confidence grades to eliminate manual intervention for classifying the document.