The growth of real-time digital systems, wireless sensor networks, and distributed processing has seen the emergence of a class of applications that focus on the continuous analysis of real-time data. This data takes numerous forms, for example, audio, video, images, text, sensor data, and is typically subjected to complex processing. Stream processing is a distributed programming paradigm that has been shown to be suitable for processing massive amounts of continuous data in real-time. It is based on the application of a series of processing operators to each element in a continuous data set (or stream) rather than the sequential processing of each data item.
While stream processing need not be implemented in a distributed fashion, it is in this domain that it is most often considered. The flow of data in a stream processing system can be represented in a dataflow graph where each node in the graph models a processing task, or role, that may produce output streams of continuous data from the given input streams. Each of the processing roles would then be deployed to a set of (possibly distributed) processors. The existing literature on stream processing has mainly focused on a centralized application manager that can allocate each processing role to a suitable (possibly distributed) processor. A centralized determination of the role allocation is effective in many scenarios and should facilitate a global optimization of the role allocation.
However, as the volume of data, the number of processors, and the complexity of the dataflow graph all increase, a centralized approach will potentially suffer scalability issues. This problem will be amplified if the number of processing nodes is constantly changing, which may be the case if the processing nodes are drawn from a volatile resource pool. Typical scenarios where this may arise would be stream processing systems that leverage the processor capability within a large network of mobile phones, wireless sensors or the spare capacity within a network of fixed computers (as with the Berkeley Open Infrastructure for Network Computing “BOINC” architecture).