1. Field of the Invention
The present invention relates to the field of data management systems. More specifically, the present invention pertains to a method and system for normalizing dirty text in a document.
2. Prior Art
In today's business environment, the importance of collecting data that reflect your business activities to achieve competitive advantage is widely recognized. Powerful systems for collecting data and managing it in large databases are in place in most large and mid-range companies and many small companies. It is estimated that the amount of data stored in the world's databases doubles every twenty months. However, all of this data is useless without a method of filtering and organizing it into useful information.
Data mining is a technology that was developed to discover hidden patterns in data to develop models to predict future trends. It uses a variety of statistical analysis techniques to group instances of data into classes or patterns which are not readily apparent to the user. Users can, for example, discover demographic attributes about their customers which were not known before, or predict future behavior based upon previous patterns.
In order to analyze data accurately, the data must be standardized when it is entered into the database. Misspelled words can, for example, skew the data set which will alter the outcome of a data mining query. An example where this could be a serious problem is a customer support center. Here, a customer calls in if they have a problem with a product. Personnel at the support center work with the customer to resolve the problem. The support personnel usually fill out a log which records information about each call.
The support center personnel are often in a hurry to handle the volume of calls coming in, and do not have time to edit their logs. Misspellings, typographical errors, ad hoc abbreviations, and joined words (known collectively as “dirty text”) are common problems in these call logs. If a company is trying to examine these call logs to identify products with a history of service problems or what those problems are, they need a system to clean up dirty text.
Accordingly, the need exists for a method of normalizing dirty text from documents prior to them being analyzed. Misspelled words and phrases, as well as ad hoc abbreviations, should be identified and replaced with correctly spelled standardized terms within documents. It is also desirable that this method of normalizing the document can take place in cases where standardized terms do not exist a priori and must be inferred from the corpus of the documents.