Many businesses actively monitor data streams and application messages using existing systems to detect business events or situations and take time-critical actions. These existing systems include databases and sensor networks, along with systems for publication-subscription, integration, monitoring, and business intelligence. However, these existing systems are neither scalable nor efficient to handle complex event processing such as required by high-volume continuous data streams.
Some existing systems are centered around a pull model in which data is first stored and then queried. In such systems, all incoming data is first stored in memory (e.g., local trace files on disk) which reduces the performance (e.g., only a few thousand events per second may be processed). Such existing systems provide no considerations for data from different sources arriving with different latency.
Other existing systems use stateless filters to process streams of data. In such systems, the filters operate on each message from the data stream but do not store any information from one message to the next. As such, these systems cannot be used to make conclusions about a particular sequence of received messages.
Some existing systems rely on relational algebra to mathematically describe data manipulation. The Structured Query Language (SQL), for example, is a higher-level form of relational algebra implemented in a pull model. In such a model, a query is translated into relational algebra operators. A SQL optimizer may re-order the operators using different permutations to identify a semantically equivalent expression that produces the desired result with the least processing. This expression is referred to as a query execution plan. Incoming data is processed according to this query execution plan.
Functional operators implementing relational algebra, however, are not suited to manipulating high-volume, continuous stream data in real-time. Additionally, the semantics of SQL queries over streaming data is vague. By first storing all the data, processing performance of SQL queries is limited due to the disk input/output of the hardware storing the data to be queried. Additionally, SQL installations cannot be cascaded for distributed processing. This results in a system that cannot operate on high-volume data streams in real-time.