High performance stream processing is critical in many sense-and-respond application domains. In many cases, the number of distinct sub-streams on which one needs to perform group-independent aggregation and join operations are not known a priori (for example, environmental sensors may come online or might be turned off, securities may be added or removed from the stock market) and the logical sub-streams carrying state updates or individual enterprise transactions might be multiplexed in a single physical stream feed. Consequently, expressing queries on this data using existing relational stream processing algebra is often not possible or very costly, in particular when an application is processing streams with very high data rates as is common for stock market application, environmental sensors, etc.
Large scale sense-and-respond systems continuously receive external signals in the form of one or more streams from multiple sources and employ analytics aimed at detecting critical conditions to allow for reactive behavior potentially in proactive fashion. Examples of such systems can include SCADA (Supervisory Control And Data Acquisition) systems deployed for monitoring and controlling manufacturing, power distribution, and telecommunication networks, environmental monitoring systems, as well as algorithmic trading platforms. Sense-and-respond systems share the need for calculating baselines for multiple samples of incoming signals (for example, instantaneous electricity production levels, the fair price of a security, etc.) as well as the correlation of the computed value for a signal with other signals (for example, instantaneous electricity consumption levels, the ask (or offer) price of a security, etc.). The computation of baselines is typically performed by aggregating multiple samples based on a group-by aggregation predicate. Such an aggregation can, for example, be executed in different ways over different granularities by the establishment of a window over the incoming data.
This step can be referred to as the sensing portion of a system. On the other hand, the correlation operation is typically the result of a join operation, where two signals are paired, generally using a window over the incoming data streams, and the result is used to drive an automated response whereby, for example, a request for the generation of extra power is made or a block of securities is sold or bought. This operation corresponds to the responding portion of a sense-and-respond system. In many situations, the number of signals to be independently aggregated and correlated is very high. For example, stock market feeds can contain information about trading for thousands of different securities. A financial firm processing and acting on information gleaned from the US equity market, for example, must track more than 3000 different stocks and an even larger number of options on these stocks. Similarly, there are around 3000 power plants in the United States and millions of consumers. Streaming sense-and-respond systems must be able to cope with such a large influx of data.
In both examples, one can argue that the underlying architectural pattern representing these sense-and-respond streaming systems includes a large number of window-based aggregation operations coupled in some fashion with a large number of window-based join operations operating on a collection of distinct sub-streams. By way of example, in many cases, the number of distinct sub-streams might not even be known a priori (for example, securities may be added and/or removed from the market) and the logical sub-streams might be multiplexed in a single physical stream feed (for example, a Reuters Stock Market Data Feed). Consequently, expressing such queries in relational stream processing algebra is often not practical, or is very costly, due to the overhead created by the large number of resulting independent queries, as well as the need for updating the set of queries as sub-streams dynamically arrive and depart.