1. Field of the Invention
The present invention relates generally to stream processing and in particular, to a computer implemented method for processing data streams. Still more particularly, the present invention relates to a computer implemented method, apparatus, and computer usable program code for applying stochastic control optimization to determine lazy versus eager message propagation in distributed stateful messaging systems.
2. Description of the Related Art
Stream processing computing applications are applications in which the data comes into the system in the form of an unbounded sequence or stream of messages, sometimes called “events.” Note that the volume of data being processed may be too large to be stored, and intermediate results are typically required before all input messages have arrived. Therefore, the information stream must be processed on the fly. Examples of stream processing computing applications include video processing, audio processing, streaming databases, and sensor networks.
In stream processing systems, producers, also called publishers, deliver streams of events. Consumers, also called subscribers, request continuous updates to results of computations on data from one or more streams. Results are expressions, such as “average trading price of the stocks having the top ten total volume traded.” Subscribers define the desired results via a specification, sometimes called a “query.” For example, the specification may consist of a continuous query using relational operators, such as join, select, project, aggregation, and top-K and may be expressed in a language, such as structured query language (SQL). Computations on event streams that require data to be retained between messages, such as to compute a running average or sum, are called “stateful computations”, and queries requiring stateful computations are called stateful queries.
The stream processing system implements the function of receiving events and computing and propagating changes to the subscribed state by means of a delivery plan, also called a “query execution plan.” The delivery plan is implemented as a data flow network of transforms. For example, the network of transforms may be a collection of Java® objects. Each transform accepts messages representing changes to an input to the transform operator, updates a local state, and produces messages representing changes to the result of the transform operator. The changes are then propagated “downstream” towards other transforms in the flow or towards the ultimate consumers. The transforms are deployed on a distributed network of machines called servers or message brokers. When the data flow is distributed over multiple servers, some of the message traffic between one transform and another will flow over a physical connection, such as a TCP-IP connection.
In many stream processing systems, unnecessary messages may be delivered from one transform to the next. Unnecessary is best explained in the context of an exemplary transform in a server A that sends a message to a downstream transform in a server B, only to have that message discarded or ignored. For example, a change to the stock price of issue one may be sent from A to B, but B then ignores it because issue one is not one of the top ten trading stocks. Sending the ignored messages is useless, even resulting in wasted bandwidth, processing power, and memory. If messages are suppressed that turn out later to be needed, the downstream server may have to request additional messages by sending explicit requests to the upstream servers, resulting in delays.