Extract, Transform and Load (ETL) refers to a process in which data is extracted from a data source, it is transformed in accordance with specified criteria (e.g., to fit a specified business need and/or quality level), and it is then loaded into a target (e.g., a data warehouse). ETL tasks are growing in complexity, which means they are requiring increasing levels of computational support.
It is desirable to parallel process data flows associated with an ETL task to improve computational efficiency. However, determining how to parallel process data flows is difficult.
Accordingly, it would be advantageous to provide a technique for dividing an ETL dataflow task into sub-tasks for execution on distributed resources. Ideally, such a technique would account for contextual information, such as cache resources, inter-process communication requirements and staging requirements.