The invention relates in general to the field of computer-implemented methods for pre-processing work items, in a context where work items are being queued for pre-processing by a receiver, which uses a blocking interval to build blocks of work items, which are then passed to a scheduler for subsequent processing.
One also knows cluster computing frameworks. For example, the so-called “Apache Spark” is an open source cluster computing framework, comprising multiple components. Its core components provides distributed task dispatching, scheduling, and basic I/O functionalities. They fundamentally rely on so-called Resilient Distributed Datasets (RDDs), i.e., a logical collection of elements partitioned across machines (nodes) of a cluster, which can be operated on in parallel. Amongst other components, the so-called “Spark Streaming” component (an extension of the core Spark component) enables scalable, fault-tolerant stream processing of live data streams with high-throughput, while enabling streaming analytics. Spark Streaming receives input data streams and divides the data into batches, which batches are then processed to generate a stream of results in batches.
Optimizations are available to minimize the processing time of each batch. Beside the batch period, another parameter to consider is the receiver's blocking interval, which is determined by the configuration parameter spark.streaming.blockInterval and need be set beforehand. I.e., received data are coalesced into blocks of data before being processed. The number of blocks in each batch determines a number of tasks that will be used to process the received data. The number of tasks per receiver per batch is approximately equal to the batch interval divided by the block interval. For example, a block interval of 100 ms results in 10 tasks per 1 second batches. If the number of tasks is too low (i.e., less than the number of cores per machine), subsequent processing is inefficient as not all the available cores are used to process the data.
As an alternative to multiple input streams per receivers, one may repartition the input data stream to distribute received batches of data across a specified number of machines in the cluster before further processing.