In order to improve accuracy of data classification, the data is required to be cleansed. The data cleansing involves removal of noise to achieve higher accuracy in the data classification. The noise may be relevant but unusable, irrelevant, or usable. The relevant but unusable noise may comprise phrases or concepts which are relevant to all categories. For example, consider a data set which requires classification into subsystems. The phrases such as “not working” present in the data set may not require analysis as any subsystem may be in a “not working” mode. Therefore, it is necessary to remove relevant but unusable noise to improve the data classification accuracy.
On the other hand, usable noise comprises phrases which might be specific to certain categories. For example, a word ‘enter’ may be specific to a user access subsystem, although in general, the word ‘enter’ might be considered a stop word. Therefore, the presence of usable noise in the data set is significant for accurate data classification. The existing methods of data cleansing fail to remove the relevant but unusable noise while retaining the usable noise to improve the data classification.