The present invention relates to computer systems, and more specifically to declarative specification of data integration workflows for execution on parallel processing platforms.
MapReduce is an example of a software framework that is utilized to define and execute data integration workflows on parallel processing platforms. MapReduce is utilized for processing large datasets to solve certain kinds of distributable problems using a large number of computers, collectively referred to as a cluster if all nodes use the same hardware or as a grid if the nodes use different hardware. Computational processing occurs on data stored either in a filesystem (unstructured) or within a database (structured). A map step in a MapReduce framework includes a master node receiving input, partitioning the input up into smaller sub-problems, and distributing the smaller sub-problems to slave nodes. A reduce step in a MapReduce framework occurs when the answers of a group of sub-problems are combined in some way to get the output (i.e., the answer to the problem that it was originally trying to solve).
An example of a MapReduce framework is Hadoop, which includes a programming model and an associated implementation for processing large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges the set of intermediate values associated with the same intermediate key. An advantage of using a MapReduce framework is that it allows for distributed processing of the map and reduce operations. Mapping operations are independent of each other, and thus, at times all of the map functions are performed in parallel, although in practice this is often limited by the data source and/or the number of central processing units (CPUs). MapReduce is used by very large server farms to sort through petabytes of data in a relatively short period of time. The parallelism supported by MapReduce also allows for recovering from the partial failure of servers or storage during the operation. For example, if one mapper or reducer fails, the work is rescheduled (assuming that the input data is still available).