Extract-Transform-Load (ETL) tools are special purpose software artifacts used to populate a data warehouse with up-to-date records from one or more sources. To perform this task, a set of operations is applied on the source data. Many existing ETL tools organize such operations as a workflow.
Unlike structured query language (SQL), which is declarative in nature, ETL workflows are procedural and specify the sequence of steps to transform the source tables into the target warehouse. Many existing ETL tools provide support for the design of ETL workflows, but provide no support for optimization of such workflows. The efficiency of the ETL workflow thus depends, to a large extent, on the skill and domain knowledge of the workflow designer. Also, typically, an ETL workflow is designed once and executed periodically to load new data. Further, an ETL workflow can also degrade over time due to the changing nature of the data.
Accordingly, a need exists to optimize and re-order a workflow as needed. Further, there exists a need to develop techniques that can enable cost-based optimization of workflows when input statistics are missing.