Data-driven stream processing has a wide range of applications in many fields, including stock market prediction, intrusion detection, and disaster recovery. It is data-driven in the sense that the analysis system continuously reacts to the streaming data. It is different from traditional database-oriented systems, which operate on static stored data.
Systems for processing streams of data utilize continuous streams of data as inputs, process these data in accordance with prescribed processes and produce ongoing results. Commonly used data processing stream structures perform traditional database operations on the input streams. Examples of these commonly used applications are described in Daniel J. Abadi et al., The Design of the Borealis Stream Processing Engine, CIDR 2005—Second Biennial Conference on Innovative Data Systems Research (2005), Sirish Chandrasekaran et al., Continuous Dataflow Processing for an Uncertain World, Conference on Innovative Data Systems Research (2003) and The STREAM Group, STREAM: The Stanford Stream Data Manager, IEEE Data Engineering Bulletin, 26(1), (2003). In general, systems utilize traditional database structures and operations, because structures and operations for customized application are substantially more complicated than the database paradigm. The reasons for this comparison are illustrated, for example, in Michael Stonebraker, Ugur çetintemel, and Stanley B. Zdonik, The 8 Requirements of Real-Time Stream Processing, SIGMOD Record, 34(4):42-47, (2005).
These systems typically operate independently and work only with the processing resources contained within a single system to analyze streams of data that are either produced by or directly accessible by the single site. Although multiple sites can be used, these sites operate independently and do not share resources or data.
One area of concern in the use of system to process data streams is recovery from failures in software, hardware and applications within the system. Previous work on failure recovery and high availability includes M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker, Fault-Tolerance in the Borealis Distributed Stream Processing System, SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 13-24. ACM Press, New York, N.Y., USA, 2005. ISBN 1-59593-060, in which fault tolerance was achieved by employing “process-pairs,” thus incurring a high overhead that was acceptable in their under-utilized system but is not consistent with the design philosophy of some large scale distributed systems. Other approaches are described in J. Hwang, Y. Xing, U. Cetintemel, and S. Zdonik, A Cooperative, Self-Configuring High-Availability Solution for Stream Processing. In ICDE'07 and J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik, High-Availability Algorithms for Distributed Stream Processing, ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE '05), pages 779-790. IEEE Computer Society, Washington, D.C., USA, (2005) ISBN 0-7695-2285-8, however these approaches have several shortcomings including an inability to work with heterogeneous systems or to support multiple independent failures. In general, previous work either only focused on offering high availability or was only applicable to a relatively homogeneous environment.