1. Field of the Invention
The present invention pertains to query processors that receive data continuously from one or more publishers of data streams in order to push query results as data streams to one or more subscribers continuously.
2. Description of the Related Art
There are many real-time information systems today that receive data continuously from multiple information sources. Examples include financial systems tracking traded instrument transactions, geographically distributed weather information systems, industrial systems collecting information from distributed devices, public train systems, credit card systems, real-time military information systems, as well as many others. Each of these systems not only receives data continuously from multiple sources, but also processes the data in order to detect specific conditions and/or merge, sort, transform, and derive meaningful results.
One aspect of continuous processing on multiple heterogeneous data streams that has received minimal attention and standardization in the real-time information industry is a general-purpose and organized approach for processing multiple continuous data streams. In the last few years, the Stanford STREAM project has researched and published many papers describing a Continuous Query Language (CQL) that can be used for general purpose query language processing on data streams. Berkeley's TelegraphCQ project also has researched and published papers describing a Continuous Query system which employs modifications on the OpenSource PostgreSQL Database, which is conventionally known. As such, TelegraphCQ utilizes tables in PostgreSQL to perform the continuous queries. We would expect reduced performance from this approach since PostgreSQL's tables are not meant for continuous processing and because tables naturally reside on disk. Brown University's Aurora data stream Management System project processes query results immediately as each new data stream element arrives, but it does not use SQL-like queries to process information and lacks publish and subscribe sources of data stream information. The Aurora implementation also does not provide support for different incoming and outgoing formats and representations. Celequest's Streaming DataFlow Engine is based on Stanford's STREAM research. Stanford's STREAM data stream Management System, however, employs an implementation that does not process data streams directly. It also lacks the ability to accept and generate data streams of information employing different formats and representations.
Data stream data in STREAM is processed via a description in CQL using three types of operators. They are Stream-to-Relation, Relation-to-Relation, and Relation-to-Stream. All three operators must be employed in a CQL expression in order to produce a resulting output stream. CQL uses SQL constructs to express Relation-to-Relation operators, and most of the data manipulation in CQL is executed via a description of the use of these constructs. Stream-to-Relation operators in CQL are primarily based on sliding windows. This allows CQL to consider recent data (as per a sliding window based on time or amount of data) from a data stream to be in a “Relation”, after which the Relation-to-Relation operators may perform. One way of thinking about the operational mechanics of this processing is to consider all of the data in the sliding window to be in a virtual table representing the Relation. Finally, an output stream may be produced via a Relation-to-Stream operator which may be one of IStream, DStream, or RStream representing “insert stream”, “delete stream”, and “relation stream” respectively. One way of thinking about how these operators work is to think of the virtual table created by the Stream-to-Relation operator. The IStream represents the data produced when considering the data that gets inserted into the virtual table. The DStream represents the data produced when considering the data that gets removed from the virtual table due to the limited sliding window size, while RStream represents the data in the virtual table. Note that this implies that an RStream is a Stream where each datum is a Relation where each Relation, in turn, contains one datum or more. Each query in the STREAM data stream Management System therefore, has the form                Select [Relation-to-Stream Operator] [Stream-to-Relation Operator][Relation-to-Relation Operator]        
As in the example:                Select IStream(*) From S1 [Rows [000], S2 [Range 2 Minutes] Where S1.A=S2.A And S1.A>10        
While the STREAM data stream Management System does offer general purpose data stream manipulation, the approach incorporates an overly complex object model. STREAM's CQL is overly complex because it requires one to think about the three operator types in order to construct a query. Operators are not applied to pure data streams causing one's development of queries to be overly difficult due to overly sophisticated conceptualization. The support of three Relation-to-Stream operators provides rich functionality but the practical use of the DStream and RStream operators is limited.
The Stanford research does not address how to integrate publish-subscribe semantics in order to support publishers of data streams and subscribers to data streams. Publish-subscribe semantics is an important information flow paradigm to support for continuously updating data sources and targets in enterprise real-time information processing systems.
A drawback of the lack of a general-purpose and organized approach for processing multiple data streams continuously is the development cost for each multiple data stream processor. Each time a new set of behavior and goals is required to be implemented, a new continuous processor must be constructed. These distinct continuous processors tend to perform much of the same core processing comprising receiving data, filtering data, joining data, transforming data, detecting data, and computing derived data. However, the code developed for each continuous processor is different, requires time and resources to design and develop, and time and resources to assure quality as well. Software development, maintenance, and enhancement costs are high and unnecessary when compared to a general-purpose continuous processor that can be reused across different continuous processors. Furthermore, many continuous processing systems merge and process data in ways that can be shown to be equivalent to query-like manipulations. While the Stanford STREAM system provides general purpose processing, it maintains an overly complex object model and lacks support for the key publish-subscribe paradigm. This will result in software development and engineering costs that are not as low as they could be.
Since the Celequest Streaming DataFlow Engine is based on Stanford's STREAM system, it maintains the same software engineering cost issues. Celequest's system appears to treat all streamed data as events and states that it employs XML data between its internal subsystems. While XML maintains known software engineering advantages, its use for core processing in high performance systems is generally discouraged.
In light of the above, there is a need for methods and apparatus for a general-purpose continuous query processor with simplified object model and support for publish-subscribe data stream sources and targets.