The present invention relates in general to data processing systems and in particular to using computers to merge data from multiple sources into a common database.
A data warehouse is a central repository of multiple databases that include the historical data of a company or organization. Data warehouses contain large amounts of data that may be utilized to support management decisions. A data analyst may utilize a data warehouse to perform complex queries and analysis without slowing down other operational systems. A data warehouse is thus optimized for reporting and analysis to minimize query response times. Databases within a data warehouse therefore include data in a consistent standardized format. Standardization is the process of checking and converting text and/or integer values in a data attribute to a predefined format or a set of predefined values. Before the data of a standardized attribute is stored in a common repository, the value in the attribute is compared against a set of rules that govern how the data must be formatted, and if necessary, the data is converted to fit the format defined by the rules.
Tools and code that perform standardization of values being entered into a database are typically configured to be aware of multiple data attributes and to transform one record at a time to a standardized value based on the dependencies between component values. The sum of serialized standardization operations can add significant time to data loading operations. Also, data that is standardized with Extract-Transform-Load (ETL) tools, which are typically located “outside” an application, can create maintenance problems if attempts are made to standardize values at multiple locations, since the standardization checkpoints at each location can get out of synch. Furthermore, it may be problematic to share a standardization “rule set” among multiple applications in a data warehousing environment. When the standardization rules change, older previously stored data must be updated. Conventional ETL tools also increase processing overhead by applying standardization rules to all incoming data, regardless of whether or not the data originated from a “trusted” source.