In an increasingly information-centric world, people and organizations rely on time-critical tasks that require accessing data from highly dynamic information sources and generating responses derived from on-line processing of data in near real-time. In many application domains, these information sources can take the form of data streams that are time-ordered series of events or sensor readings.
Due to the large and growing number of users, jobs, and information sources, as well as the high aggregate rate of data streams distributed across remote sources, performance and scalability are key challenges in stream processing systems (SPSs). In some programming models, stream processing applications may be made up of a group of operators, which may be small pieces of code that carries out functions such as generic data transformations, filtering, annotation, classification, de-multiplexing, splitting or other domain-specific operations. Operators may interact through streams, which can carry a potentially infinite sequence of tuples. A challenge in building distributed stream processing applications is to find an effective and flexible way of mapping the logical graph of operators into a physical one that can be deployed on a set of distributed nodes. This involves finding how best to assign operators to computing nodes that execute the operators. That is, a challenge in building high-performance distributed stream processing applications is to find the right level of granularity in mapping operators to processes to be deployed on a set of distributed compute nodes. The challenge of creating flow graphs for deployment, out of user-specified operator-level flow graphs, has flexibility and performance aspects.