Technical Field
The present teaching relates to methods, systems, and programming for stream processing. Particularly, the present teaching is directed to methods, systems, and programming for event state management in stream processing.
Discussion of Technical Background
Near-real time stream processing is an important computation paradigm. Some use cases include quick response to external events (e.g. stock ticker), early trend detection (e.g. buzzing news or search queries), and dynamic user modeling and personalization (e.g. view/click processing). Currently, known applications use different (mostly ad hoc) streaming technology and provides different levels of services with regard to system reliability, fault-tolerance, persistence of data, and atomicity and consistency of state management. In contrast, almost all offline bulk processing is either moved or getting moved to HADOOP platform, which provides a standard functional programming model, reliability and repeatability guarantees, and redundant HDFS storage.
A large class of near real-time stream processing applications needs to manipulate state in some form. This state could take the form of key-value database store, windows of time-series data, aggregations of past data or any combination thereof. Typically, enterprise-level applications require the state to be maintained with classical ACID (atomicity, consistency, isolation, and durability) qualities. However, most consumer applications can operate under a simpler ACID2.0 (associative, commutative, idempotent, and distributed) model, which is much more cost-effective and scalable for large data. The challenge is to create an appropriate execution platform that implements this model in a manner that is easy to use and operate. Also, many streaming applications need to provide a consistent bulk view of the state data in order to be seeded, processed, enriched, and accessed in bulk using a large-scale, fault-tolerant, offline grid processing framework such as MapReduce. It is much more efficient if the same infrastructure can handle both requirements. Additionally, the system needs to provide a consistent and repeatable view of the state with respect to the stream processing even in face of failures.
Known stream processing frameworks address some aspects of high-throughput, low latency stream processing. But in most systems, the stream processing is not directly integrated with a state management store that is also available for offline processing. For example, some known stream processing frameworks lack any persistent state storage and read-write synchronization of their own. In other words, those know solutions don't give much importance to consistency and reliability of the state generated by online systems. Other known solutions perform read-write synchronization at the event level, which unduly increases the overhead of state management. Therefore, there is a need to provide an improved solution for event state management in stream processing to solve the above-mentioned problems.