Processing of a data stream can be quite a resource intensive procedure. A data stream is a sequence of an order list of values (called a “tuple”). Many established and emerging applications can be naturally modeled as data stream applications. In order to monitor a data stream, a user registers continuous queries with the Data Stream Management System (DSMS). These queries continuously update their state and produce new output for newly arriving stream tuples. In a typical data stream application users expect at least quasi-real time results from their continuous queries, even if the stream has a high rate of arrival. Due to these requirements, data stream processing can be very resource intensive.
Examples of a data stream are monitoring of networks and computing systems, consumer credit card purchases, telephone calls dialed by callers, monitoring of sensor networks, and supply chain management and inventory tracking based on RFID tags. Another example of a data stream is measurement data, such as IP traffic at router interfaces, sensor networks, and road traffic measuring. Even publish-subscribe and filtering and dissemination of RSS feeds (such as for monitoring the “blogosphere”) can be viewed as data stream applications.
One way in which this resource intensive problem has been addressed is to distribute the processing load over multiple nodes in a network. A fundamental challenge, however, of such a distributed stream processing system is to select the correct criterion for distributing load in the system. Load balancing in traditional distributed and parallel systems is a well-studied problem. These techniques do not carry over to data stream processing, because load balancing decisions on a per-tuple basis are too costly. In load balancing techniques, incoming jobs (queries) have to be assigned to processing nodes such that throughput is maximized or latency (response time) is minimized. This is usually achieved by some type of load balancing, which takes into account the availability of input data at the processing nodes and communication costs for moving data between nodes.
In a data stream processing system, the roles of queries and data are reversed from traditional distributed systems. Namely, queries are continuously active while new data tuples are streaming in at a high rate. This creates new challenges for a data stream processing system compared to traditional distributed systems. In a data stream processing system the individual input tuples are small. It is therefore too costly to decide for each tuple individually to which processing node it should be routed. Furthermore, for operators with state (such as sliding window joins), re-routing tuples would also require migrating operator state to the new processing nodes.
In order to amortize the optimization cost, tuple routing decisions should be made such that they benefit many stream tuples. This is achieved by assigning operators to processing nodes. These are operators that take real-time data from things like network monitoring sensors, stock market, etc., and perform some form of processing on the data (such as rank the data, correlating the data, or filter the data). The data is not stored somewhere and then processed, but instead the data is processed in real time as it arrives to the operator. In other words, each operator inputs a stream of events, processes the stream of events, and outputs a processed stream of events.
Given a set of these streaming operators that are running on a data processing system, and given that there is a collection of computing devices or processors connected together, the goal is to determine how to best assign these streaming operators to those processors. This is called operator placement. The operator placement, and hence the routing pattern for the tuples, are used for a large number of input tuples and are only changed when system statistics change significantly. This can be called a Distributed Operator Placement (DOP) problem.
Several techniques have been proposed for placing operators in a distributed streaming system for the purpose of balancing load and improving query latency. These techniques are based on some type of operator placement strategy. One obvious solution to the DOP problem is to assign operators to nodes such that system load is balanced for a “typical” case. Optimizing for the “typical” load is not sufficient. This is because data streams in practice tend to be “bursty”, meaning that data is received in large waves at one time and a trickle of data the next moment. This bursty nature makes it virtually impossible to react to short-duration load bursts with any kind of load re-balancing. While the system is busy adapting to a burst, the load situation might already have changed significantly again to require another adaptation.
To address this problem, some techniques use resilient operator placements, where the system can handle a wide variety of load situations without any node being over-loaded. A related idea for distributed stream processing is to prevent load spikes by placing operators with uncorrelated load behavior onto the same node and to maximize load correlations between different nodes. Other techniques for distributing load for data stream processing include distributing the load of a single operator. Queuing theory has provided valuable insights into scheduling decisions in multi-operator and multi-resource queuing systems, but results are usually limited by high computational cost and strong assumptions about underlying data and processing cost distributions.
The problem with all of these operator placement strategies is that they are heuristics that have been found to achieve good overall results. In other words, these approaches take some heuristic that researchers believe result in some good placement of the streaming operators without any type of mathematical structure, and then apply that placement that provides good results. The point is that these existing techniques do not provide a solid mathematical foundation or a solid optimization foundation to the DOP problem. While an assignment of streaming operators to processors is made, left unanswered are questions about the quality of the assignment and even how to measure the quality of the assignment in a principled, precise way. Thus, these placement strategies are based on trial and error and these heuristic solutions are not designed to directly optimize a specific, application oriented optimization goal.