1. Field of the Invention
The present invention relates generally to replication of information in data processing environments and, more particularly, to systems and methodologies for analyzing, filtering, enhancing and processing streams of data.
2. Description of the Background Art
Stream processing is concerned with filtering, enriching and enhancing continuous data streams. In order to detect opportunities and threats as early as possible, stream processing systems often need to analyze complex, fast moving, heterogeneous streams of data in real-time. In many cases, they also need to be able to rapidly process historical streams. For example, the ability to rapidly process and analyze historical streams of data is useful in refining trading strategies in the financial services sector. Stream processing software and systems need to run continuously, providing analytics, statistics, filtering, inference, deduction, connection, pattern matching, tracking and tracing.
Stream processing is complementary to databases, data warehousing, data mining, and search engines. The emphasis in stream processing is on continuous time-based information, on time-critical analysis, and often on situations where most of the data is irrelevant to most people most of the time. In stream processing, one is often looking to pinpoint a rare and important opportunity or threat, without drowning in the relentless flow of data that is typically received by most users and organizations.
Processing of real-time and historical data streams is a critical component of information technology solutions in many application areas, including the following:
Applications in which continuous live data streams from people, sensors, systems and networks are automatically monitored, filtered, analyzed, enhanced and enriched in real-time.
Systems where continuous analytics on massive volumes of real-time data enable businesses and financial services organizations to intelligently discover and immediately respond to opportunities and threats, manage risk, ensure compliance, and deliver the best possible personalized customer experience at all times.
Solutions wherein continuous inference on semantic graphs allows intelligent real-time discovery of deep and important connections within live data streams.
Applications involving continuous real-time analysis of streams of data from networked wireless sensors, such as those that will enable a new intelligent, secure, optimized and highly energy-efficient infrastructure—a new generation of smart buildings, homes, factories, utilities, energy networks, IT systems, data centers, networks and other equipment, each with always-on energy saving, intrusion prevention, and predictive maintenance capabilities.
Solutions utilizing continuous real-time intelligent tracking and tracing (GPS, RFID) enabling powerful new location-based services to be launched, theft and counterfeiting to be reduced, and transportation and distribution to be optimized.
The following is a list of specific application areas that may involve stream processing, although it is by no means a complete list:
Business (Dynamic Pricing, Mobile Advertising, Customer Experience Management, Supply Chain, Logistics, Marketing Intelligence, Risk Management, Compliance, Counterfeit Prevention).
Web and Telecommunications (Marketplaces, Online Games, Social Networks, Personalized Newsfeeds, Semantic Web, Virtual Worlds, Location-Based Services, Fraud Prevention).
Government (Homeland Security, Intelligence, Defense, Compliance).
Financial Services, Banking and Insurance (Algorithmic Trading, Risk Management, Compliance Tracking, Live Oversight, Fraud Prevention, News Services).
Machine-to-Machine Computing (Sensors, Smart Buildings, Remote Monitoring, Predictive Maintenance, Intrusion Prevention, Location Tracking, RFID, Process Control and Optimization, System and Network Monitoring).
In areas such as financial services and government applications, a number of products have been developed which permit the application of rule-based systems (expert systems) to event streams. U.S. Pat. No. 6,782,381 (Method and apparatus for evaluating queries against received event information, Nelson, Giles John et al.), the description of which is hereby incorporated by reference, describes one such approach to rule-based stream processing. United States Patent Application 2004/0220921 (Time series monitoring system, Billock, Joseph Greg et al.), the description of which is hereby incorporated by reference, describes another rule-based approach to event processing. The various rule-based event processing products aim to provide complete applications for complex event processing by coordinating analysis across multiple simultaneous event streams, external databases and spreadsheets, by merging streams, triggering the addition of new compound events and so on. The general technical approach involves using open-ended rule-based expert systems that check against patterns for matches and applying logic such as “when pattern matches, do_action_a, do_action_b, etc.”. However, these rule-based expert systems have a number of limitations in processing of data streams. For example, with such systems it can be complex to express even rather simple stream processing applications where the aim is to filter and enrich a stream by partitioning it into a large number of substreams, computing substream properties across various time-based windows, then interleaving the resulting substreams to produce a filtered and enriched output stream. For applications of this kind, it would be desirable to have a solution providing the functionality and structuring of a carefully designed domain-specific programming language, rather than an open-ended rule-based system. Another limitation of rule-based expert systems concerns parallelization. Handling large-scale stream processing applications requires the ability to easily and automatically parallelize stream processing algorithms and programs. In contrast to domain-specific programming languages, rule-based expert systems do not provide a framework in which automatic and efficient parallelization for large-scale implementation can be easily achieved.
In recent years, a number of database research teams in universities have been adapting database techniques in general, and the SQL query language in particular, in order to perform certain types of stream processing (e.g., merge, join, sort, select) across multiple streams of structured data. For a recent example, see, e.g., Yijian Bai et al. “A Data Stream Language and System Designed for Power and Flexibility” in Proceedings of the 15th ACM Conference on Information and Knowledge Management (November 2006), the disclosure of which is hereby incorporated by reference. Several of these research projects have led to commercial products. However, these commercial solutions rely on the use of a database query language (e.g., SQL), and provide for adapting the database query language for use in stream processing. This results in some tradeoffs and inefficiencies in the use of a language that was designed for a different purpose than stream processing.
Keyword-based publish-subscribe systems have also been used for many years to disseminate real-time news to users based on preset subscriptions. Recently, various academic projects have worked to extend the traditional publish-subscribe model in various ways, to give it more expressive power. For example, the Cayuga project combines publish-subscribe automata with a SQL style query language for stream processing on multiple streams (for further description of Cayuga, see e.g., Alan Demers et al. “Cayuga: A General Purpose Event Monitoring System” in Proceedings of the Third Biennial Conference on Innovative Data Systems Research, Asilomar, Calif. (January 2007), the disclosure of which is hereby incorporated by reference). In a similar manner, the SASE project combines database techniques and a rule-based approach to produce an event processing architecture (for further description of the SASE project, see e.g., Eugene Wu et al. “High-Performance Complex Event Processing over Streams”, in Proceedings of the ACM SIGMOD Conference, pages 407-418 (June 2006), the disclosure of which is hereby incorporated by reference).
Since the 1970s, general-purpose dataflow programming languages have been used for stream processing of various kinds. Moreover, some dataflow languages are currently being used for certain kinds of real-time stream processing applications. For example, the real-time dataflow language Lustre (the kernel language of the SCADE (formerly SAO+/SAGA) industrial environment) from Esterel Technologies can be used for critical control software in aircraft, helicopters, and nuclear power plants (for further description of Lustre, see e.g., N. Halbwachs et al “The synchronous dataflow programming language Lustre”, in Proceedings of the IEEE, 1991, pages 1305-1320 (1991), the disclosure of which is hereby incorporated by reference). To achieve the strict real-time requirements of such applications, the design of Lustre programs can be quite complex, involving detailed analysis of clock synchronization amongst multiple streams.
Batch stream processing architectures such as Hancock have been developed to provide offline (non-real-time) processing of huge repositories of historical stream data, such as phone call records (for further description of Hancock see, e.g., Corinna Cortes et al. “Hancock: A Language for Analyzing Transactional Data Streams” in ACM Transactions on Programming Languages and Systems, Vol. 26, No. 2, pages 301-338 (March 2004), 1-338, the disclosure of which is hereby incorporated by reference).
IBM's System S Research Project is developing a prototype aimed at providing the “middleware” required to coordinate a wide range of distributed stream processing applications. The System S research project aims to produce a stream processing framework that is general-purpose. System S assumes that there are many user-developed stream processing components in use across the Internet, and that the main goal of the System S Stream Processing Core is to provide middleware coordination software that can tie these numerous components together in useful ways. However, System S does not provide a system and method that can be used to build a scalable stream processing architecture.
GigaSpaces of New York, N.Y. offer a product allowing users to offload data from Microsoft Excel to the GigaSpaces in-memory data grid, in order to be able handle larger amounts of data within Excel spreadsheets. They also provide a means of offloading parts of an Excel spreadsheet calculation to an external grid of servers, increasing the computation power available to perform standard spreadsheet calculations. While useful in terms of increasing the computation power and storage available for spreadsheet computations, these products do not offer any capability to perform real-time stream processing on data streams. The Excel spreadsheet model does not provide that capability, nor does the GigaSpaces parallel implementation of Excel.
Mashup editors, such as Yahoo Pipes allow users to combine web data from various sources using sorting, filtering and translation modules, in a way that is similar to the way in which Unix pipes are used to connect primitive Unix library functions. Yahoo Pipes is a web application (available from Yahoo!, Inc. of Sunnyvale, Calif.) that provides a composition tool to aggregate, manipulate, and mashup content by letting users “pipe” information from different sources (e.g., web feeds, web pages, and other services) and set up rules for how that content should be manipulated (e.g. filtering). Like Unix pipes, Yahoo Pipes permits simple commands can be combined together to create output meeting user needs. However, mashup editors such as Yahoo Pipes are not well suited for stream processing, in that they do not handle continuous data streams. The Yahoo Pipes user interface is typical of most visual programming tools, in that much of the complexity of the interface stems from the need to show the often complex mesh of wires connecting the building blocks. This visual complexity presents a major obstacle to the scalability of the approach for complex applications. It would be preferable to have a solution that was “wire-less”, offering a powerful and highly scalable interface for complex stream processing applications.
Many modern software systems require mechanisms for the transformation and querying of XML streams. In recent years, a large number of software tools have been developed for the processing of XML streams, based on languages such as XPath, XSLT and XQuery. United States Patent Application 2004/0205082 (System and method for Querying XML streams, Fontoura, Marcus et al.) describes one such approach to XML processing. Open source software tools such as the SAXON XSLT processor (authored by Michael Kay and available from Saxonia.com of Reading, United Kingdom) provide extensive support for handling XML data. However, they do not provide a general-purpose solution for handling data that is not in XML format.
What is needed is a solution that is specifically designed for general-purpose stream processing. The solution should have the ability to handle all kinds of streaming data, including tables, text, feeds and graphs. Ideally, the solution should facilitate real-time processing applications requiring complex combinations of analysis, statistical monitoring, filtering, tracking and tracing, inference, deduction and pattern matching. The present invention provides a solution for these and other needs.