Data cleansing is an important step in the data mining, text analysis and performing data classification. This is the process of removing noisy, incorrect, improperly formatted and garbage data to achieve higher accuracy in categorizing data. However determining whether a word or concept belongs to noise or if it is important is a very difficult process due to its scale.
For example in a system that classifies different types of news items, the word “crashed” could either mean a software crash, airplane crash or a building crash. However if the news sources are all related to software, then the meaning of the word is clear.
Conventional approaches for data cleansing, may not be accurate as they are not able to detect such words common within different domains.