1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for scheduling for data warehouse ETL (extract, transfer, and load) processing and data mining execution.
2. Description of Related Art
A data warehouse is a collection of data designed to support management decision making. Data warehouses contain a wide variety of data intended to support development of management reports or data mining models depicting business conditions at a particular point in time. Data warehouse systems include systems to extract data from business applications and a warehouse database system to which the extracted data is transferred and loaded in an organized fashion so as to provide business managers useful access to the data. Data warehouses generally combine data extracted from many different databases across an entire enterprise. The data processing requirements of extracting data from many databases, transferring it across an enterprise, and loading it meaningfully into a data warehouse are typically large.
Data warehouse systems typically carry out many ETL (Extract, Transfer and Load) processing steps to populate data warehouse tables. In order to populate the data warehouse tables properly, the execution of these steps is typically linked in a predefined order. In typical data warehouse systems, a warehouse scheduler (“ETL scheduler”) starts execution of a first ETL step, then the execution of the other steps is followed according to a pre-defined sequence.
ETL processing in data warehouse systems is typically implemented as a combination of a fixed schedeule of ETL processing steps and external programs for carrying out the ETL processing steps. An ETL scheduler reads the steps from a schedule and calls external program in sequence according to the schedule. For most data warehouse systems, the execution of a next step in ETL processing is based on a limited set of predefined conditions, such as on-success, on-failure or on-completion of the previous steps. There is no flexibility of how the execution sequence and execution frequency can be controlled by the characteristics of the external programs. All linked steps must be executed in the same frequency. However, in some cases, some steps may need to be executed less frequently than the overall schedule itself. Some steps may have dependency on the schedule of other steps, or some steps may need to be executed based upon some other external conditions.
For example, the execution of data mining operations may need a schedule different from a default daily ETL schedule. A step for data mining model training is usually very time consuming. A default daily scheduling of ETL processing for model training can represent a large processing burden with respect to the overall ETL schedule. It can be useful to carry out mining training only weekly or biweekly. An ETL step for loading mining data may not need to be executed if mining training and mining apply are not to be executed. If a model training step is not executed, then a post processing step for developing a mining model result also may not need to be executed. However, these scheduling control can not be achieved by typical prior art warehouse ETL scheduling systems.
It would be advantageous to have improved methods of providing flexibility in scheduling of ETL processing steps for data warehouse systems. In addition, because of the complexity of data warehouse systems, it would also be advantageous if addition flexibility could be provided with a reduced need, or, even better, no need, to modify existing scheduling systems.