1. Technical Field
The present invention relates generally to database normalization, online analytical processing and data mining. More specifically, it relates to a system and method for applying search and text analytics and text mining technology to automate and optimize data normalization for data mining and business intelligence applications.
2. Related Art
Almost all industries recognize the value of mining data from databases of electronic information. One of the critical challenges in the field of data mining involves providing high quality data. Often, an entity may have data stored in multiple databases, which store data in different formats, different naming conventions, and for different purposes. Thus, the relevant input data to be mined typically comes from numerous database management systems (DBMS's) that were created over the years, by different people, different organizations, using non-uniform values, names, schemas, platforms, database tools, etc. For example, a company that merges with another company to increase its customer base or provided services immediately faces the need to extend the aggregate of its business processes across the composite of the data from both companies. In order to effectively exploit the data in each of the databases, the data needs to be “normalized” into a common format, sanitized to eliminate errors, and disambiguated to eliminate redundancies, anomalies, polysemi (i.e., a word that has two or more meanings which are more or less “related” in accordance with some general principle), etc.
Traditionally, 70-80% of the costs associated with undertaking a large data mining program are associated with the tremendous effort required to sanitize, normalize and disambiguate the data before the mining heuristics can be applied and exploited. The quality of the data being mined (i.e., the input) directly affects the quality of the mining results (e.g., categorizations, predictions, trend analyses, etc.). Accordingly, without effectively “normalizing” the data that is going to the mined, there is little value in mining the data and, without an efficient solution, the costs are prohibitive.
Currently, normalization is accomplished using ETL (extract, translate and load) tools that rely on manually developing point-to-point (database to data warehouse) translations—requiring deep insight and analysis into each database being ETL'ed—as well as a full understanding of the existing and future applications processing the data. Thus, implementing ETL tools is an expensive and time consuming process that requires a substantial amount of manual intervention. Frequently, the ETL process must be repeated to support new or modified business processes and applications.
A second major challenge involves the fact that companies are often dealing with very large volumes of data. Many systems today can apply ETL and Data Mining technology to Gigabyte (109) and Terabyte (1012) databases, but they cannot effectively support Petabyte (1015) or Exabyte (1018) systems. As data volumes are growing at extremely high rates, organizations are experiencing difficulties managing their data, let alone provide easy access to users, and exploit (i.e., mine) them effectively and efficiently.
Accordingly, a need exists for an automated system for normalizing data, so that the data can be queried and mined.