The present invention relates generally to data stream processing using multiple nodes, by a plurality of software-implemented tasks, and relates more particularly to load distribution—with the goal of reducing energy costs—by concentrating tasks on a subset of possible nodes, called target nodes, meeting predetermined criteria in terms of load distribution quality (such as anticipated end-to-end throughput time), allowing other nodes without tasks to be quiesced as an economy measure.
With the proliferation of Internet connections and network-connected sensor devices comes an increasing rate of digital information available from a large number of online sources. These online sources continually generate and provide data (e.g., news items, financial data, sensor readings, Internet transaction records, and the like) to a network in the form of data streams. Data stream processing units are typically implemented in a network to receive or monitor these data streams and process them to produce results in a usable format. For example, a data stream processing unit may be implemented to perform a join operation in which related data items from two or more data streams (e.g., from two or more news sources) are culled and then aggregated or evaluated, for example to produce a list of results or to corroborate each other.
However, the input rates of typical data streams present a challenge. Because data stream processing units have no control over the sometimes sporadic and unpredictable rates at which data streams are input, it is not uncommon for a data stream processing unit to become loaded beyond its capacity, especially during rate spikes. Typical data stream processing units deal with such loading problems by arbitrarily dropping data streams (e.g., declining to receive the data streams). While this does reduce loading, the arbitrary nature of the strategy tends to result in unpredictable and sub-optimal data processing results, because data streams containing useful data may unknowingly be dropped while data streams containing irrelevant data are retained and processed.
The majority of known solutions for load distribution in event-driven systems assume that event processing components are stateless. Very few known solutions target stateful operators because migrating stateful operators for load distribution purposes is challenging and expensive. In order to migrate a stateful operator, all data stream processing has to be stopped, all necessary state has to be migrated and all the events routing paths should be updated accordingly. Furthermore, most of these solutions are centralized.