The present disclosure relates to computer software. More specifically, embodiments disclosed herein relate to deploying parallel data integration applications to distributed computing environments.
An existing parallel execution environment for data integration may take one of four configurations: symmetric multiprocessing (SMP), massive parallel processing (MPP), cluster, or grid. SMP contains one physical server, while the other parallel execution environments support multiple physical servers. Those servers may be further configured to serve as compute nodes, I/O nodes, or database nodes. With limited available resources, jobs (data integration applications) cannot always run whenever they need to—they must be scheduled via a job scheduler or managed via a workload manager to share system resources so as to prevent the system from being overloaded. The end result is that jobs may have to spend time waiting on resources, which delays its end-to-end execution cycle.
A distributed computing environment can be setup in a much larger scale with hundreds to thousands of servers. The challenge of running parallel jobs in a distributed computing environment is how to integrate the parallel engine with the distributed engine, as each may have their own specific run models. The parallel engine supports the process-based model, while the distributed engine supports the task-based model. One solution is to develop a high level abstraction layer which encapsulates different run mechanisms. This layer is responsible for detecting which execution engine is used at run time. If it is the parallel engine, this layer then invokes the parallel run mechanism; otherwise, this layer invokes the distributed run mechanism. The problem with this solution is that it needs to maintain two sets of libraries for the same processing logic. The processing logic of a parallel operator must be re-implemented using the APIs provided by the distributed run mechanism. This requires a lot of time and effort for development and maintenance work. For example, to implement data integration on a distributed engine, one would have to implement transformation, data aggregation, and joins using the distributed APIs.
Accordingly, it is necessary to find an efficient solution that can be used to deploy parallel data integration applications to a distributed computing environment without re-implementing core data processing logic such as data transformation, aggregation, pivoting, and joining