1. Field of the Invention
The present invention relates to database systems. More particularly, the present invention pertains to apparatus and methods for supervision of data extraction, transformation, and loading (ETL) procedures in data warehousing and data mart applications.
2. Description of Related Art
A data warehouse, as is known in the art, is a central repository for all or significant parts of the data that an enterprise's various business systems collect. Typically, a data warehouse is housed on an enterprise server computing system. Data from various processing applications and other sources is selectively extracted and organized on the data warehouse database for use by analytical applications and user queries. Data extracted and organized from a data warehouse is often placed in a data mart. A data mart is a repository of data gathered from operational data and other sources such a data warehouse that is designed to serve a particular group of people requiring the information.
It is known in the art that in managing databases, extract, transform, and load (ETL) refers to three separate functions combined into a single program procedure. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data—using rules or lookup tables, or creating combinations with other data—to convert it to the desired state. Finally, the load function is used to write the resulting data (either all of the subset or just the changes) to a target database, which may or may not previously exist.
ETL procedures can be used to acquire a temporary subset of data for reports or other purposes, or a more permanent data set may be acquired for other purposes such as: the population of a data mart or data warehouse; conversion from one database type to another, and the migration of data from one database or platform to another.
The definitions of the data warehouse, data mart and ETL procedures were found 7/3/2002, at www.searchDatabase.com.
Refer now to FIG. 1 for a description of a database processing system of the prior art. Programming applications 10a, 10b, . . . 10n are executed by computer processors to create data that is stored in specific databases in database storage units 15a, 15b, . . . , 15n. The programming applications 10a, 10b, . . . 10n are applications such as On-Line Transaction Programs (OLTP), where customers are able to create data such as purchases or sale order entry. The ETL procedures 25a, 25b, 25c, . . . , 25n are executed by the computer processors to remove data from the databases in the database storage units 15a, 15b, . . . , 15n, transform the data to the desired state and format, and loading of the transformed data to the data warehouse 40.
Historical data generated by older computer systems is maintained on a “legacy database” in a separate database storage unit 20. The legacy database similarly is extracted, transformed, and loaded by the legacy ETL procedure 30 to the data warehouse 40 for integration with the currently generated data from the program applications 10a, 10b, . . . 10n.
In order to prevent the database information within the database storage units from being modified and potentially corrupted, an operational data storage (ODS) staging unit 35 is employed by the ETL procedures 25a, 25b, 25c, . . . , 25n, and 30 as a staging database retention device. The ETL procedures 25a, 25b, 25c, . . . , 25n, and 30 perform the transformations of the data from the database storage units 15a, 15b, . . . , 15n, and 20 prior to placement of the data to the data warehouse 40.
The ETL procedures 25a and 25b illustrate that multiple ETL procedures may be performed on a database from a single database storage unit 15a, 15b, . . . , 15n,. The transformation portion of the ETL procedure performs different conversions to the database for placement in the data warehouse 40. To improve the quality of the database or segment of the database being extracted from each of the database storage units 15a, 15b, . . . , 15n, a cleansing process is executed to eliminate any inconsistencies, redundancies, and corruptions from the data being extracted. The cleansing process may be a separate program executed by the computer processor or a sub-procedure of the ETL procedure 25a, 25b, . . . , 25n. 
The data marts 50a, 50b, . . . 50n are databases within database storage units that are created to serve a particular group requiring an individualized database. The ETL procedures 45a, 45b, . . . , 45n create and maintain of the data marts 50a, 50b, . . . 50n by extracting the necessary data from the data warehouse 40, transforming the data to meet the particular group requirements, and storing the data in databases 50a, 50b, . . . 50n retained in database storage units.
As is apparent, large enterprises may have many databases stored on database storage units 15a, 15b, . . . , 15n, 20, and 50a, 50b, . . . 50n . To create and maintain one or more data warehouses, many ETL procedures 25a, 25b, 25c, . . . , 25n are executed periodically to extract, transform, and load the data from the data storage units 15a, 15b, . . . , 15n to the data warehouse 40. The ETL procedures 45a, 45b, . . . , 45n then create and maintain the data within the data marts 50a, 50b, . . . 50n . The quality, effectiveness, and success of each of the ETL procedures 25a, 25b, 25c, . . ., 25n, 30, 45a, 45b, . . . , and 45n is monitored and the success or failure communicated individually. There is no central repository of the status of the ETL procedures 25a, 25b, 25c, . . . , 25n, 30, 45a, 45b, . . . , and 45n as a whole.
U.S. Pat. No. 6,208,990 (Suresh, et al.) teaches a computer software architecture to automatically optimize the throughput of data extraction/transformation/loading (ETL) process in data warehousing applications. This architecture has a component aspect and a pipeline-based aspect. The component aspect refers to the fact that every transformation used in this architecture is built up with transformation components selected from an extensible set of transformation components. Besides simplifying source code maintenance and adjustment for the data warehouse users, these transformation components also provide these users the building blocks to effectively construct pertinent and functionally sophisticated transformations in a pipelined manner. Within a pipeline, each transformation component automatically stages or streams its data to optimize ETL throughput. Furthermore, each transformation either pushes data to another transformation component, pulls data from another transformation component, or performs a push/pull operation on the data. Thereby, the pipelining; staging/streaming; and pushing/pulling features of the transformation components effectively optimizes the throughput of the ETL procedure.
U.S. Pat. No 6,189,004 (Rassen, et al.) demonstrates a method and apparatus for creating a data mart and for creating a query structure for the data mart. The method automatically defines a query interface for a data mart. The data mart includes fact and dimension tables. The method comprises accessing a schema description and a query interface description for the data mart. The schema description specifies a schema, which in turn, defines the relationships between the fact tables and dimension tables of the data mart. The query interface description specifies the fields, related to the schema description, which can be used in a query and the way in which results are to be presented to the user. The fields correspond to columns and rows in the fact tables. The schema description is used to create a first set of commands to create and populate the fact and dimension tables. Additionally, a second set of commands to create the query interface is created. Some commands of the first set of commands are executed causing the creation and population of the tables. Some commands of the second set of commands are executed causing the creation of a user interface. A query is generated using the user interface. The query is sent to the system for processing. The results of the query are presented to the user according the second set of commands.
U.S. Pat. No. 6,151,608 (Abrams) describes a method and system for migrating data from one or more ASCII files and/or from one or more relational databases to one or more relational database tables without the need to write code. Abrams allows the user to define mapping templates and conditionals to assist in translating and transforming data values. Referential integrity, data dependencies, order of operations, and uniqueness constraints are enforced using a predefined set of migration rules templates that are based on the principles of relational design. The mapping and migration rules templates are used to intelligently generate instructions for updating or populating relational database destination tables. The instructions control the data transfer, data translation, data transformation, data validation, foreign key insertion, and the addition of required codes and flags in the destination tables. A migration engine of the system includes a data map architect and an update processor, which spawns the templates and migrates the data dynamically, utilizing the data definitions for the destination tables. In addition to supporting conversion efforts, Support for performing consolidation, restoration from an archive, migration to new instances, upgrading to a new release, adding additional features and enhancements, and changing business requirements is provided. The audit trail is complete enough that the entire migration process can be reversed without compromising the integrity of either the source or the destination application.
“How to Select an Extraction/Transformation/Loading (ETL) Tool”, Mimno, 101communications LLC, Chatsworth, Calif. 91311, found 6/27/2002 www.101dataintelligencesolutions.com, provides techniques for selecting an appropriate ETL tool. The ETL tool operates at the heart of the data warehouse, extracting data from multiple data sources, transforming the data to make it accessible to business analysis, and loading multiple target databases. Unlike other components of a data warehousing architecture, it's extremely difficult to switch from one ETL tool to another. Because of a lack of standards, data definitions and transformation rules cannot be moved from one tool to another. If the selected ETL tool fails in production, it is the one component of the architecture that isn't easily replaceable.