The present disclosure relates to computer software, and more specifically, to computer software to implement task-based modeling for parallel data integration.
An existing parallel execution environment for data integration may take one of four configurations: symmetric multiprocessing (SMP), massive parallel processing (MPP), cluster, or grid. SMP contains one physical server, while the other parallel execution environments support multiple physical servers. Those servers may be further configured to serve as compute nodes, I/O nodes, or database nodes.
A distributed computing environment can be setup in a much larger scale with hundreds to thousands of servers. To run parallel jobs in a distributed computing environment, the parallel computing engine must be integrated with the distributed computing engine, as each may have their own specific run models. The parallel engine supports the process-based model, while the distributed engine supports the task-based model. In the task-based model, a data flow consists of subflows, and each subflow is considered a task. A task may be run through one or more processing units determined by the process-based model. When attempting to run parallel jobs in the distributed computing environment, a task-based execution plan based on a process execution plan must be created.