This disclosure relates to a data processing and, in particular, to adaptive resource scheduling for data stream processing.
Stream processing systems are computing systems designed to support ingestion and processing of continuous streams of data. One well known stream processing system is Spark™ (SPARK is a trademark of the Apache Software Foundation), which is an open source cluster computing framework.
Spark Core is the principal component of SPARK and provides distributed task dispatching, scheduling, and input/output (I/O) functionalities. These functions are accessed through an application programming interface (API) that employs a Resilient Distributed Dataset (RDD) abstraction of a fault-tolerant dataset. The API serves as a “driver” that invokes parallel operations on an RDD by passing a function to Spark Core, which then schedules the function's execution in parallel on the cluster.
Spark Streaming is an additional SPARK component that utilizes the scheduling provided by Spark Core to repetitively perform streaming analytics on a continuous data stream. Spark Streaming receives a DStream (discretized stream), which is a sequence of micro-batches of RDD data (of any of a variety of supported formats), and performs desired RDD transformation(s) on the DStream.
Because SPARK tasks are assigned dynamically within a cluster based, for example, on data locality of the data and the availability of computational and storage resources, SPARK can provide significant benefits, such as improved load balancing and fault recovery. As appreciated by the present disclosure, discretizing the data records of the input data stream into micro-batches of data also presents a significant challenge with regard to resource allocation and utilization.