The stream processing paradigm has always played a key role in time-critical systems. Traditional examples include digital signal processing systems, large-scale simulation platforms, multimedia clients and servers, and high resolution rendering farms as described in Microsoft DirectX version 9.0 software development toolkit. http://msdn.microsoft.com/directx/directxSDK/default.aspx; Aravind Arasu, Brian Babcock, Mayur Datar, Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom. STREAM: The Stanford stream data manager (demonstration description). In Proceedings of the 2003 ACM International Conference on Management Data (SIGMOD 2003), San Diego, Calif., June 2003; J. T. Buck, S. Ha, E A. Lee, and D. G. Messerschmitt. Ptolemy: a platform for heterogeneous simulation and prototyping. In Proceedings of the 1991 European Simulation Conference, Copenhagen, Denmark, June 1991; Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the 2003 Conference on Innovative Data Systems Research (CIDR 2003), Asilomar, Calif., 2003; P. D. Hoang and J. M. Rabaey. Scheduling of DSP programs onto multiprocessors for maximum throughput IEEE Transactions on Signal Processing, 41(6):2225-2235, June 1993; Greg Humphreys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter D. Kirchner, and James T. Klosowski. Chromium: A steam-processing framework for interactive rendering on clusters. 2002; Rainer Koster, Andrew Black, Jie Huang, Jonathan Walpole, and Calton Pu. Infopipes for composing distributed information flows. In Proceedings of the 2001 ACM Multimedia Workshop on Multimedia Middleware, Ottawa, Canada, October 2001; Stan Zdonik, Michael Stonebraker, Mitch Cherniak, Ugur Cetintemel, Magdalena Balazinska, and Hari Balakrishnan. The Aurora and Medusa projects. Bulletin of the IEEE Technical Committee on Data Engineering, March 2003, which are hereby incorporated by reference in their entirety. More recently, distributed stream processing systems are being developed for high performance transaction processing, continuous queries over sensor data and enterprise-wide complex event processing.
In today's distributed stream data processing systems, massive numbers of real-time streams enter the system through a subset of processing nodes. Processing nodes may be co-located, for example within a single cluster, or geographically distributed over wide areas. Applications are deployed on processing nodes as a network of operators, or processing elements, as depicted in FIG. 1. Each data stream is comprised of a sequence of Stream Data Objects (SDOs), the fundamental information unit of the data stream. Each processing element performs some computation on the SDOs received from its input data stream, e.g., filter, aggregate, correlate, classify, or transform.
The output of this computation could alter the state of the processing element, and/or produce an output SDO with the summarization of the relevant information derived from (possibly multiple) input SDOs and the current state of the processing element. In order to carry out the computation, the processing element uses computational resources of the processing node on which it resides. The available computational resources on a node are finite, and are divided among the (possibly multiple) processing elements residing on the node either through time-sharing of the processor, or a parallel processing mechanism.
In a distributed stream processing system, both network and processor resources are constrained. Thus, efficient use of resources, low delay, and stable system operation are the critical resource management challenges. While these goals are typical for resource schedulers, properties of the distributed stream processing system complicate matters. For example, each processing element's resource utilization is constrained by processing elements that are upstream and downstream of the processing element in the processing graph. Further, a processing element's resource consumption may be state dependent, resulting in bursty processor and network utilization throughout the system. Even developing an appropriate measure of effectiveness is difficult because the units of work (input packets) and operations (processing element computations) are unequally weighted, and therefore monitoring resource utilization alone is insufficient.
Stream processing jobs are relatively long running and as new work is introduced into the system, the relative weights or priorities of the various jobs may change. The task of assigning weights or priorities to jobs may be performed by a human, or it may be performed by a “meta scheduler”. The goal of meta schedulers generally is to assign time-averaged allocation targets based on relative importance of work submitted to a system. In comparison, the goal of a resource scheduler is to enforce these long-term allocation targets. In traditional shared processor environments, resource schedulers are responsible for selecting a waiting process from the ready queue (queue of processes waiting) and allocating the resource (CPU) to it. Priority-based or proportional share schedulers allow a system administrator to configure the system such that when a job is submitted, a weight or priority may be assigned. This weight or priority information may then be used by the scheduler in the decision process for selecting a waiting process from the ready queue.
Current scheduling/meta-scheduling technology does not adequately address stream processing environments. Examples of scheduling/meta-scheduling technology are describe in U.S. Pat. No. 4,814,978 entitled “Dataflow processing element, multiprocessor, and processes”; U.S. Pat. No. 5,241,677 entitled “Multiprocessor system and a method of load balancing thereof”; U.S. Pat. No. 5,742,821 entitled “Multiprocessor scheduling and execution”; U.S. Pat. No. 6,167,029 entitled “System and method for integrated data flow control; U.S. Pat. No. 6,415,410 entitled “Sliding-window data flow control using an adjustable window size”; U.S. Pat. No. 6,426,944 entitled “Method and apparatus for controlling data messages across a fast packet network”; U.S. Pat. No. 6,694,345 entitled “External job scheduling within a distributed processing system having a local job control system”; U.S. Pat. No. 6,795,870 entitled “Method and system for network processor scheduler”; and U.S. Pat. No. 6,795,442 entitled “System and method for scheduling message transmission and processing in a digital network”, which are hereby incorporated by reference in their entirety. In this environment, the entities to be scheduled (processing elements) are interconnected such that the input (e.g., data packets) of one processing element is some or all of the output of one or more processing elements. The issue arises when either the rate of data packets arriving at a processing element is bursty or the resources required to process a data packet is bursty.
Today's resource schedulers typically take one of three approaches: strict enforcement, guarantee-limit enforcement and velocity enforcement. One problem with strict enforcement is that if the resource scheduler attempts to strictly enforce the long-term allocation target provided by the meta-scheduler, the input buffer of the processing element may overflow when a burst of data arrives. Additionally, consider the case when two processing elements (PE A and PE B) are executing in a single processing node. During some time intervals, the input rate of PE A may temporarily require less than its long-term allocation, while the input rate of PE B may temporarily require more than its long-term allocation. If the resource scheduler strictly adheres to the allocation of the meta-scheduler, the buffers of PE B overflow, even though resources are not fully utilized. Strict enforcement is further described in Saowanee Saewong and Ragunathan (Raj) Rajkumar. Cooperative scheduling of multiple resources. In RTSS '99: Proceedings of the 20th IEEE Real-Time Systems Symposium, page 90, Washington, D.C., USA, 1999. IEEE Computer Society, which is hereby incorporated by reference in its entirety.
Under guarantee-limit enforcement, the inputs to the resource scheduler are a minimum guaranteed allocation and a limit on the maximum allocation for each job. This solution would enable PE B (from previous example) to utilize additional resources during periods of low activity for PE A. However, since the scheduler does not take the processing element's instantaneous buffer occupancy and input data rate into account, it does not increase the processing element's short-term processing allocation in the event of a burst of input data, thereby increasing the likelihood of a buffer overflow at the processing element. Guarantee-limit enforcement is further described in Shailabh Nagar, Rik van Riel, Hubertus Franke, Chandra Seetharaman, Vivek Kashyap, and Haoqiang Zheng. Improving Linux resource control using CKRM. In Proceedings of the 2004 Ottawa Linux Symposium, Ottowa, Canada, July 2004; Dionisio de Niz, Luca Abeni, Saowanee Saewong, and Ragunathan (Raj) Rajkumar. Resource sharing in reservation-based systems. In RTSS '01: Proceedings of the 22nd IEEE Real-Time Systems Symposium (RTSS ″01), page 171, Washington, D.C., USA, 2001. Computer Society; Abhishek Chandra, Micah Adler, Pawan Goyal, and Prashant Shenoy. Surplus fair scheduling: A Proportional-Share CPU scheduling algorithm for symmetric multiprocessors. Pages 45-58, which is hereby incorporated by reference in its entirety.
Under velocity enforcement each processing element is assigned a weight, the higher the weight, the less the processing element should have to wait for a resource when being selected from the ready queue. Thus, the resource scheduler bases its selection from the ready queue on the weight (velocity) assigned the processing element, and the amount of time the processing element has had to wait for resources in the current epoch. Consider the scenario where the input data rate into a PE is bursty. At a given instant of time the input buffer of the PE is empty, i.e., the PE is idle. Subsequently, the PE receives a burst of data. A velocity based scheduler would process one SDO in the PE's input buffer and then wait until the PE's wait time exceeds the velocity value of the PE before processing the subsequent SDOs. Owing to the burst, it is possible for the processing element's input buffer to overflow with data while it is in the wait-state. Velocity enforcement is further described in P. Bari, C. Covill, K. Majewski, C. Perzel, M. Radford, K. Satoh, D. Tonelli, and L. Winkelbauer. IBM enterprise workload manager, which is hereby incorporated by reference in its entirety.
Thus, traditional scheduling approaches are not directly applicable to stream processing systems. This is primarily because the requirements of such systems go beyond traditional processor sharing, e.g. stream processing systems challenge the practice of statically assigning of priorities to processing elements. Furthermore, resource management specifically for distributed stream processing systems has focused on effective placement of processing elements and load management. In dynamic placement techniques, the operator (PE) placement can be modified during execution to adapt to changes in resource availability, based on maximizing some objective function on a time-averaged basis. Dynamic placement is further described in Peter Pietzuch, Jonathan Ledlie, Jeffrey Shneidman, Mema Roussopoulos, Matt Welsh, and Margo Seltzer. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), April 2006. Load shedding was proposed as a means to intelligently drop tuples (SDOs) from input queues, based on thresholds and potentially packet content. Load shedding is further described in Magdalena Balazinska, Hari Balakrishnan, and Michael Stonebraker. Load management and high availability in the medusa distributed stream processing system. In SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 929-930, New York, N.Y., USA, 2004. ACM Press, which are hereby incorporated by reference in their entirety.
Both dynamic placement and load shedding work target environments where the system must adjust to available underlying resource allocations (either by moving operators or shedding load). However, these techniques ultimately require over-provisioning to deal with the unpredictable nature of stream processing.
Therefore a need exists to overcome the problems with the prior art as discussed above.