The present invention relates to job design for parallel data integration, and more particularly to automatic and dynamic creation of configurations for parallel data integration jobs.
Data integration in a parallel computing environment typically involves computations with large data sets, which may even exceed thousands of Gigabytes, distributed among computers or partitions. (Partitions may correspond to the computers in some instances and may exist as computer subdivisions in other instances.) This, of course, consumes both processing and storage resources.
Regarding the storage resources, the data sets include i) one or more originating data sets (also known as “data sources”), which consume disk space that may be associated with respective originating computers, and ii) one or more destination data sets (also known as “data sinks”), which consume computer disk space that may be associated with respective destination computers. Such data sets are for long term storage, which may be referred to as “permanent” storage.
In addition to consuming long term storage for originating and destination storage data sets, data integration requires temporary storage. That is, during a time while data sets are processed, some operations require memory, e.g., disk space, to store intermediate results. (This memory to store intermediate results of processing operations may be called “scratch space” or “intermediate memory.”) For example, a sorting operation may require memory for storing intermediate results during the time the sort is executing.
In a parallel computing environment, various operations on data sets can be performed across different computers and the operational steps for processing the data sets can be intertwined and complicated. Processing steps, i.e., data processing flow, to be performed on one or more data sets may be described by a “data flow graph” (also referred to, more simply, as a “data graph”), which may be a pictorial or string representation of relationships among operations in the data processing flow. A data graph describes data sources, operations to be performed and data sinks. The data flow graph may be coupled with other information—for example, a configuration file describing which operations are performed on which partitions and a data schema for each operator at the input and output for the given operator.
Based on the above, it should be appreciated that resource utilization may include data source and sink storage space, scratch space, and processor utilization.