Distributed data streaming systems are information processing systems that receive large volumes of data typically from external data sources such as, by way of example, sensor networks, stock trading or other financial networks, web traffic sources, network monitoring sources, gaming systems, Internet of Things (IoT) networks, etc. The data generated by such data sources are typically unbounded sequences of messages that are received over long periods of time or even perpetually in some cases. Since large volumes of data are being ingested, the distributed data streaming system attempts to process the data using multiple compute nodes in a scalable and near real-time manner. Various data analytics may typically be performed on the data. Examples of such distributed streaming systems include, but are not limited to, Apache Storm™, Apache Flink®, Apex™, and Apache Spark™ (The Apache Software Foundation).
To cope with unexpected failures in such long-running or even perpetual systems, many existing data streaming systems support generation of state checkpoints (or simply, checkpoints) at pre-defined fixed intervals (e.g., every 10 seconds) or fixed events (e.g., every 100 messages). A checkpoint operation is an operation that involves saving a snapshot (copy or image) of the state of the system (e.g., data and system parameters) in permanent (persistent) shared storage, so that the data streaming system can restart from that failure point by reloading the most recent snapshot from the permanent shared storage.