Batch processing of large data volumes raises the need for highly efficient processing parallelization. In data pipeline processing, where the output of one stage is the input to the next one, processing parallelization relates to two main aspects: data parallelization, wherein a single task is executed in parallel on multiple data items, and task parallelization, where multiple tasks are executed in parallel on the same data, or on derivatives of the data obtained by preceding processing stages.
Prior art solutions rely on hardware-intensive computing, such as mainframe computers or commodity clusters. Such solutions are difficult to setup and manage, generally expensive, and inflexible, as the infrastructure needs to be owned and maintained even at times when low or no processing requirements exist.
A practical solution is offered by cloud computing in the form of utility computing. The “function as a service” (FaaS) paradigm is a cloud-based computing service that allows customers to develop, run, and manage application functionalities without the burden of building and maintaining the infrastructure. In such “serverless” paradigm the cloud provider acts as the server, dynamically managing the allocation of machine resources. Accordingly, pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity.
However, the FaaS paradigm introduces requirements for scheduling tasks and in particular tasks associated with batch processing, and operation under various constraints. The requirements and constraint may relate to enhancing some or all the following factors and possibly additional ones: the efficiency, the makespan, i.e. the time from the beginning of the batch until all data is processed, the cost, fault tolerance, rapid recovery in cases of failure without having to repeat significant parts of the job, data throughput due to the database and network limitations, or the like.