Data cleansing is an important step in data mining, text analysis and performing data classification. This is the process of removing noisy, incorrect, improperly formatted and garbage data to achieve higher accuracy in categorizing data. However, determining whether a word or concept belongs to noise or if it is important is a very difficult process due to its scale.
A data classifier system may wrongly classify several problem statements, due to common words which may occur across various categories. For example, if the word crash appears in a problem statement for classification, the system may not be able to classify it properly, since crash may mean anything from share market crash, software crash or aeroplane crash. Hence various categories may have similar words, and hence making data classification less precise for the systems.
Conventional approaches for data classification, may not be accurate as they are not able to detect such words common between different categories.