Stream processing refers to systems and methods that analyze continuous streams of data, i.e., streaming data, including audio streams, video streams, instant messaging or chat data streams, voice over internet protocol streams and E-mail. The analysis is used for a variety of functions from monitoring customer satisfaction to detecting fraud in the financial services industry. Being able to analyze data as it streams rather than storing and using data mining techniques offers the promise of more timely analysis as well as allowing more data to be processed with fewer resources.
Systems for processing streams of data utilize the continuous streams of data as inputs, process these data in accordance with prescribed processes and produce ongoing results. Commonly used data processing stream structures perform traditional database operations on the input streams. Examples of these commonly used applications are described in Daniel J. Abadi et al., The Design of the Borealis Stream Processing Engine, CIDR 2005—Second Biennial Conference on Innovative Data Systems Research (2005), Sirish Chandrasekaran et al., Continuous Dataflow Processing for an Uncertain World, Conference on Innovative Data Systems Research (2003) and The STREAM Group, STREAM: The Stanford Stream Data Manager, IEEE Data Engineering Bulletin, 26(1), (2003). In general, systems utilize traditional database structures and operations, because structures and operations for customized applications are substantially more complicated than the database paradigm. The reasons for this comparison are illustrated, for example, in Michael Stonebraker, Ugur etintemel, and Stanley B. Zdonik, The 8 Requirements of Real-Time Stream Processing, SIGMOD Record, 34(4):42-47, (2005).
In distributed systems where job schedulers are the central point of control, self-healing characteristics are typically mandatory. Jobs schedulers control distributed lifecycle operations including job submission, scheduling, dispatching, re-dispatching, suspension, resumption, canceling and completion. Failure of a job scheduler can result in the crash of the distributed system as whole, compromising processing that is occurring on hundreds or even thousands of nodes.
One type of job schedulers are batch schedulers, examples of which are descried in Y. Etsion, D. Tsafrir, A Short Survey on Commercial Cluster Batch Schedulers, Technical Report 2005-13, Hebrew University. May 2005. These existing batch schedulers include Moab/Maui, LoadLeveler whose ancestor is Condor, M. J. Litzkow, M. Livny, M. W. Mutka, Condor—a Hunter of Idle Workstations, Proc. of 8th International Conference on Distributed Computing Systems, pp. 104-111, 1988, Load Sharing Facility, Portable Batch System, Sun Grid Engine and OSCAR. The high-level goal of these systems is to efficiently distribute and manage work across resources. While some of these systems are able to scale to large number of nodes, such schedulers do not consider interconnected tasks. In addition, it is not feasible to apply their recovery techniques directly.
Because systems currently in use are expected to scale to thousands of nodes and because of the possible interconnected nature of its jobs, these systems need to embody autonomic capabilities, J. O. Kephart, D. M. Chess, The Vision of Autonomic Computing, IEEE Computer, 36(1): 41-50. January 2003, to make management of the system feasible. For example, hardware failures are inevitable and the system must be able to adapt by reconstructing applications on other nodes to keep the analysis running. A related “Laundromat Model”, J. G. Hansen, E. Christiansen, E. Jul, The Laundromat Model for Autonomic Cluster Computing, Proc. of International Conference on Autonomic Computing 2006, pp. 114-123. June 2006, scheme for automatic management and process migration works by making the unit of control a virtual machine. This technique allows for migration of processes, but does not consider the relationships between them.
Another related management infrastructure for stream processing, which is described in B. F. Cooper, K. Schwan, Distributed Stream Management using Utility-driven Self-adaptive Middleware, Proceedings of the Second International Conference on Autonomic Computing. pp 3-14, 2005, uses resource constraints and utility functions of business value to optimize application resource utilization. The Borealis system, M. Balazinska, H. Balakrishnan, S. Madden, M. Stonebraker, Fault-Tolerance in the Borealis Distributed Stream Processing System, Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 13-24, New York, N.Y. 2005, lets users trade off between availability and consistency in a stream processing system by setting a simple threshold. Investigation of software failure and recovery in an autonomically managed environment has also been undertaken in the context of operating systems, A. Bohra, I. Neamtiu, F. Sultan, Remote Repair of Operating System State Using Backdoors, Proceedings of the First International Conference on Autonomic Computing, pp. 256-263, 2004, and intrusion detection and recovery H.-H. S. Lee, G. Gu, T. N. Mudge, An Intrusion-Tolerant and Self-Recoverable Network Service System Using a Security Enhanced Chip Multiprocessor, Proceedings of the Second International Conference on Autonomic Computing, pp. 263-273, 2005.