The present invention relates generally to the generation of a logical data model (LDM) and a physical data model (PDM) of a data base, and more particularly, to a generalized LDM and corresponding tool for maintaining correspondence between the LDM and PDM. The present invention also relates to using a physical database design and data description language created from the PDM to construct a data warehouse.
The manual creation and population, or loading, of a data warehouse with data from varied data sources is known. Also known are automatic routines for loading data into a data warehouse, as described below. Prior approaches required duplication of effort and data to create new LDMs, new PDMs, new extraction routines, and new loading routines for every customer. Upwards of 80% of the time is spent creating and recreating LDMs, PDMs, and extraction and loading routines under prior approaches.
A prior approach for creating and populating a data warehouse using LDMs and PDMs is illustrated by the flow chart, generally designated by reference numeral 100, in FIG. 1. A description of the prior approach as shown in FIG. 1 follows. Beginning at step 102, a subject matter expert (SME) provides business rules to a modeler.
The SME, or domain expert, is a person with intimate knowledge of the particular area or topic for which the data warehouse is being constructed. The modeler is a person knowledgeable about transforming or creating representations of business rules in a data model. Business rules express the constraints on business operations in the real world.
In step 104, the modeler converts the business rules, received from the SME in step 102, into a logical data model using a data modeling tool, such as ERwin available from Platinum Technology, Inc. Other similar data modeling tools include PowerDesigner available from Sybase, Inc. and System Architect available from Popkin Software and Systems. After the modeler has converted the business rules into data structures in the LDM in step 104, the modeler produces a PDM using a data modeling tool, such as ERwin, in step 106.
The PDM of step 106 is produced in accordance with the LDM of step 104, regardless of intervening changes to the LDM occurring between the end of step 104 and the completion of step 106. In other words, changes made to the LDM while step 106 is in progress are not necessarily reflected in the final PDM of step 106.
In step 108, a data information analyst (DIA) determines data sources for a necessary attribute list (NAL). The NAL is developed by the DIA based on the DIA""s domain knowledge and the LDM. The NAL includes those fields required by the system in order to operate properly and answer the business-related questions of the user.
In response to the NAL determination of step 108 and the PDM production of step 106, the DIA produces a cross referenced source-target data list in step 110. The source-target data list produced in step 110 cross references data sources for the NAL, identified in step 108, to fields or objects in the PDM.
After the DIA produces the source-target data list, an extract specialist determines a method of operating on a chosen source system to obtain formatted data in step 112. The source system includes the data to be converted and moved into the data warehouse. The extract specialist causes the data from the source system to be converted to the data format needed by the loading effort.
The loading effort is performed by any one of several available utilities for populating the data warehouse. Some example utilities include FastLoad, MultiLoad, and Tpump available from the NCR Corporation.
The FastLoad utility loads empty tables in the data warehouse with data collected from channel and network attached clients. FastLoad can be executed from any client platform, mainframe, UNIX(copyright) or Windows NT(copyright) system. Data splitting, data conversion, data movement and data loading are all automatic and are all performed in parallel. FastLoad is fully restartable, enabling the data load process to automatically begin loading from the last transient journal checkpoint simply by re-executing the job.
The MultiLoad utility or MLOAD handles loading, updating and deleting data in existing tables as a single, integrated process with the MultiLoad utility. MultiLoad handles extremely large data volumes requiring updates and deletes. MultiLoad enables rapid acquisition of host data with rapid block-level application of data into tables. Large inserts, deletes, updates, and upserts against empty or populated tables and data maintenance on multiple tables are performed simultaneously. MultiLoad executes on a variety of client platforms, in a fail safe mode and is fully recoverable with an inherent checkpoint design.
The TPump utility handles loading near real-time data into the data warehouse. TPump inserts, updates and deletes data in the data warehouse in a continuous fashion. TPump uses row hash locks instead of table level locks thereby enabling users to run queries while TPump is working.
Once the method for obtaining formatted data is determined in step 112, the formatted data is extracted from the source system and communicated to a database management system in step 114, e.g., the Teradata Relational Database Management System (RDBMS) available from the NCR Corporation.
In step 116, the cleansing, transforming, and householding specification is generated by the DIA. The cleansing routines selected by the DIA eliminate or fix bad records from the data, for example eliminating records without values in required fields. The transforming routines deal with changing the data type, math, and semantics of the data. Householding routines perform filtering of records to appropriately group together similar records. For example, householding routines are able to group together many services provided to a single user and identify them in relation to the user. In this manner, using a telephone example, even though a person may use differing telephone services (long distance, calling card, local and toll calls) all of the services are identified as related to the same individual.
Execution of the cleansing, transforming, and householding operations on the data in the database management system occurs in step 118.
In step 120, the data is converted into loadable format for eventual loading into the data warehouse using either the MLOAD, FastLoad, or TPump utilities of the Teradata RDBMS.
Construction of the data warehouse occurs in step 122 using SQL code generated by the ERwin model from step 106. In this step, the tables and fields of the data warehouse corresponding to the PDM are instantiated within the database management system.
Once the data warehouse has been constructed in step 122 and the data converted into loadable format in step 120, the data warehouse is loaded with the data in step 124 using either the MLOAD, FastLoad, or TPump utilities. Since much of the information required for creation and population of the data warehouse is not stored in the database management system, the manual and human intensive steps making up flowchart 100 must be repeated for each and every iteration or modification of the data warehouse. This results in a large amount of duplicated data and work effort for each iteration.
Currently, the information required for creation and population of the data warehouse is stored on paper or in the knowledge-base of the individual worker. This can be problematic if changes are necessitated and either the worker is no longer available or the papers are not available or are disorganized. In either situation, reproducing or modifying the data warehouse is difficult. Therefore, there is a need in the art to achieve reproducibility of the creation and population of a data warehouse.
Accordingly, an object of the present invention is to decrease the amount of effort and duplication of data and work effort during the creation and population of a data warehouse.
Another object of the present invention is to be able to reproduce the creation and population of a data warehouse automatically.
The above and still further objects, features and advantages of the present invention will become apparent upon consideration of the following detailed description of several specific embodiments thereof, especially when taken in conjunction with the accompanying drawings.