Streaming query deployments make up a vital part of cloud oriented applications, like online advertising, online analytics, and Internet of things scenarios. Streaming query deployments vary widely in their data, logic, and statefulness, and are typically executed in multi-tenant distributed environments with varying uptime service level agreements (SLAs). In order to achieve a specified SLA, any number of resiliency strategies may be employed to protect against failure and data loss. However, choosing a particular resiliency strategy for a particular streaming query deployment is scenario dependent and not straightforward.
For example, a first streaming query deployment may have a small to moderate event rate (e.g., tens of thousands of events per second) and a very large number of states (e.g., one for each browsing session), which are active for a short period of time, then typically expire after a long holding period. Rather than redundantly store states in RAM, states may be cached in the streaming nodes for a period, then be sent to a key-value store after some time, where they are written redundantly to cheap storage, and typically expire, un-accessed. As a result, the RAM needed for streaming nodes in this scenario is small, and may be checkpointed and recovered cheaply.
In contrast, an online gaming system, with a high event rate (e.g., millions of events per second), a large number of active users, and little locality for a cache to leverage, would require a stronger resiliency strategy to maintain a similar SLA. That is, the tolerance for recovery latency in an online gaming system is very low, making it very difficult to recover a failed node quickly enough.
While various resiliency strategies exist, and different resiliency strategies are better suited for different streaming query deployments, there are no reliable rules or guidelines for selecting a particular resiliency strategy for a particular streaming query deployment.