1. Technical Field
The present invention relates to systems and methods for processing data streams and more particularly to mechanisms for processing streams of data enabling the dynamic and automatic composition of applications using processing elements (PE) operating on multiple continuous streams of data. The dynamic composition may comprise two tasks (i) determine whether a PE already running in the system can be reused for a new application and (ii) determine streaming connections based on a novel way of specifying streams and flows for PE ports. This specification allows the application writer to tap into streams that are being produced by other applications or will become available in the future. In illustrative examples, these applications are built in to satisfy inquiries for information submitted by data analysts.
2. Description of the Related Art
Data is increasingly being generated digitally from data sources such as sensors, satellites, audio and video channels, and stock feeds, called streams, which are continuous and dynamic in nature. There is a growing need for extracting information on a continuous basis from these streams to look for abnormal activity and other interesting phenomena.
Traditional information processing techniques, however, are static in nature, in two respects. First, in many cases, data from sources is stored and analyzed periodically. This store-and-analyze technique is not suitable for continuous monitoring or for obtaining real-time results because in many cases it is not possible to store all incoming data, and the cost of reprocessing old data can hinder the application performance considerably.
Second, applications are static in terms of the computation or processing applied to the data. In other words, the computation or analysis does not adapt to new and additional stream data sources being incorporated into the system.
Recently, there have been advances in the area of stream processing. Applications that track and analyze data from numerous streams, monitor them for signs of abnormal activity, identify new trends and patterns, and process them for purposes of filtering, aggregation, reduction, and correlation, are being developed. These can be viewed as stream-oriented operators.
A stream-processing system is a network of streams and stream-oriented operators that service a set of continuous inquiries for information. These operators can perform standard filtering, mapping operations, and more advanced information-mining operations on various data types such as text, audio, video, etc. and extract information to answer the inquiries about relationships and correlations present in the data.
These systems have important shortcomings in terms of providing a systematic specification of streams and data flows as well as methods for the dynamic composition of a stream processing graph. For example, the publish-subscribe (pub-sub) paradigm can be used for stream processing. In conventional pub-sub systems, subscriptions are specified in terms of logical expressions over attributes and their associated values/ranges to identify messages that the subscriber requires, or over logical names (a.k.a. topics or channels) assigned to a stream. Objects with attributes are published by publishers and the pub-sub system matches these against subscriptions and routes the objects to the interested subscribers. This enables the construction of one or more applications, each comprising a network of publishers and subscribers with flows among the publishers and subscribers.
This does not, however, address the requirement of applications that are to be incrementally composed and/or dynamically reassembled from data-processing building blocks in response to changes in the data flowing into the system or new inquiries from analysts over time. Topics and channels are not functionally described to the degree that permits the dynamic rearrangement of stages in a computation pipeline, which is needed in the stream-processing context where reacting to changes in the data stream is important.
In other stream processing systems, a graph may be composed per submitted inquiry without consideration of other inquiries currently running. Also, the streaming connections are determined when the query is submitted and not during run-time. Finally, the stream processing graph is directed and acyclic implying that there is no provision for controlled feedback stream connections.
Likewise, StreamIt™, a programming language and a compilation infrastructure specifically engineered to facilitate the programming of large streaming applications as well as their efficient mapping to a wide variety of target architectures has an organization of processing operators that is hierarchical and left to the programmer to statically specify.
In stream processing systems, data streams are processed as a pipeline of operators (which may or may not have feedback flows affecting earlier stages of the computation). If systems that support stream descriptions only in terms of stream/topic names and attributes, are then automated, incremental composition would place the burden on all application writers to agree to append attributes (from a known set) describing the operations performed on the stream. Without a means to enforce this declaration, the content routing becomes ambiguous and the dynamic and transparent composition of operators cannot be achieved.
Another body of work related to application composition across multiple queries is in the field of multi-query optimization techniques used in databases. Queries are represented using relational algebra and when multiple queries are optimized together, results of one or more queries may be reused to obtain the results of others. Because these techniques deal with persistent and static data, they are not adequate to describe flows of streams in a stream-processing context such that individual stages in a computation pipeline can specify their inputs and outputs in terms of attributes and operators.