It is well known in the art to process queries over continuous streams of data using one or more computer(s) that may be called a data stream management system (DSMS). Such a system may also be called an event processing system (EPS) or a continuous query (CQ) system, although in the following description of the current patent application, the term “data stream management system” or its abbreviation “DSMS” is used. DSMS systems typically receive from a user a textual representation of a query (called “continuous query”) that is to be applied to a stream of data. Data in the stream changes over time, in contrast to static data that is typically found stored in a database. Examples of data streams are: real time stock quotes, real time traffic monitoring on highways, and real time packet monitoring on a computer network such as the Internet.
FIG. 1A illustrates a prior art DSMS built at the Stanford University, in which data streams from network monitoring can be processed, to detect intrusions and generate online performance metrics, in response to queries (called “continuous queries”) on the data streams. Note that in such data stream management systems (DSMS), each stream can be infinitely long and the data can keep arriving indefinitely and hence the amount of data is too large to be persisted by a database management system (DBMS) into a database.
As shown in FIG. 1B a prior art DSMS may include a continuous query compiler that receives a continuous query and builds a physical plan which consists of a tree of natively supported operators. Any number of such physical plans (one plan per query) may be combined together, before DSMS starts normal operation, into a global plan that is to be executed. When the DSMS starts execution, the global plan is used by a query execution engine (also called “runtime engine”) to identify data from one or more incoming stream(s) that matches a query and based on such identified data the engine generates output data, in a streaming fashion.
As noted above, one such system was built at Stanford University, in a project called the Stanford Stream Data Management (STREAM) Project which is documented at the URL obtained by replacing the ? character with “/” and the character with “.” in the following: http:??www-db%stanford%edu?stream. For an overview description of such a system, see the article entitled “STREAM: The Stanford Data Stream Management System” by Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Mayur Datar, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom which is to appear in a book on data stream management edited by Garofalakis, Gehrke, and Rastogi. The just-described article is available at the URL obtained by making the above described changes to the following string: http:??dbpubs%stanford%edu?pub?2004-20. This article is incorporated by reference herein in its entirety as background.
For more information on other such systems, see the following articles each of which is incorporated by reference herein in its entirety as background:    [a] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Ramna, F. Reiss, M. Shah, “TelegraphCQ: Continuous Dataflow Processing for an Uncertain World”, Proceedings of CIDR 2003;    [b] J. Chen, D. Dewitt, F. Tian, Y. Wang, “NiagaraCQ: A Scalable Continuous Query System for Internet Databases”, PROCEEDINGS OF 2000 ACM SIGMOD, p379-390; and    [c] D. B. Terry, D. Goldberg, D. Nichols, B. Oki, “Continuous queries over append-only databases”, PROCEEDINGS OF 1992 ACM SIGMOD, pages 321-330.
Continuous queries (also called “persistent” queries) are typically registered in a data stream management system (DSMS) prior to its operation on data streams. The continuous queries are typically expressed in a declarative language that can be parsed by the DSMS. One such language called “continuous query language” or CQL has been developed at Stanford University primarily based on the database query language SQL, by adding support for real-time features, e.g. adding data stream S as a new data type based on a series of (possibly infinite) time-stamped tuples. Each tuple s belongs to a common schema for entire data stream S and the time t increases monotonically. Note that such a data stream can contain 0, 1 or more pairs each having the same (i.e. common) time stamp.
Stanford's CQL supports windows on streams (derived from SQL-99) based on another new data type called “relation”, defined as follows. A relation R is an unordered group of tuples at any time instant t which is denoted as R(t). The CQL relation differs from a relation of a standard relational database accessed using SQL, because traditional SQL's relation is simply a set (or bag) of tuples with no notion of time, whereas the CQL relation (or simply “relation”) is a time-varying group of tuples (e.g. the current number of vehicles in a given stretch of a particular highway). All stream-to-relation operators in Stanford's CQL are based on the concept of a sliding window over a stream: a window that at any point of time contains a historical snapshot of a finite portion of the stream. Syntactically, sliding window operators are specified in CQL using a window specification language, based on SQL-99.
For more information on Stanford University's CQL, see a paper by A. Arasu, S. Babu, and J. Widom entitled “The CQL Continuous Query Language: Semantic Foundation and Query Execution”, published as Technical Report 2003-67 by Stanford University, 2003 (also published in VLDB Journal, Volume 15, Issue 2, June 2006, at Pages 121-142). See also, another paper by A. Arasu, S. Babu, J. Widom, entitled “An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations” in 9th Intl Workshop on Database programming languages, pages 1-11, September 2003. The two papers described in this paragraph are incorporated by reference herein in their entirety as background.
An example to illustrate continuous queries is shown in FIGS. 1C-1E which are reproduced from the VLDB Journal paper described in the previous paragraph. Specifically, FIG. 1E illustrates a merged STREAM query plan for two continuous queries, Q1 and Q2 over input streams S1 and S2. Query Q1 of FIG. 1E is shown in detail in FIG. 1C expressed in CQL as a windowed-aggregate query: it maintains the maximum value of S1:A for each distinct value of S1:B over a 50,000-tuple sliding window on stream S1. Query Q2 shown in FIG. 1D is expressed in CQL and used to stream the result of a sliding-window join over streams S1 and S2. The window on S1 is a tuple-based window containing the last 40,000 tuples, while the window on S2 is a 10-minutes time-based window.
Several DSMS of prior art, such as Stanford University's DSMS treat queries as fixed entities and treat event data as an unbounded collection of data elements. This approach has delivered results as they are computed in near real time. However, in most continuous query systems this prior art approach does not allow continuous queries to be added dynamically. One reason is that a query plan is computed at the time of registration of all queries, before such a prior art DSMS even begins operations on streams of event data.
Once queries have registered and such a prior art DSMS begins to process event data, the query plan cannot be changed, in prior art systems known to the current inventors. The current inventors recognize that adding queries can be done, for example by quiescing Stanford University's DSMS, adding the required queries and starting up the system again. However, the current inventors note that it gives rise to indeterminate scenarios e.g. if a DSMS is being quiesced, there is no defined checkpoint for data in a window for incomplete calls or for data of intermediate computation that has already been performed at the time the DSMS is quiesced.
In one prior art DSMS, even after it begins normal operation by executing a continuous query Q1, it is possible for a human (e.g. network operator) to register an “ad-hoc continuous query” Q2, for example to check on congestion in a network, as described in an article by Shivnath Babu and Jennifer Widom entitled “Continuous Queries over Data Streams” published as SIGMOD Record, September 2001. The just-described paper is incorporated by reference herein in its entirety as background. Such a query Q2 may be written to find a fraction of traffic on a backbone link that is coming from a customer network.
In highly-dynamic environments, a data stream management system (DSMS) is likely to see a constantly changing collection of queries and needs to react quickly to query changes without adversely affecting the processing of incoming time-stamped tuples (e.g. streams). A solution to this problem is proposed in a PhD thesis entitled “Query Processing for Large-Scale XML Message Brokering” by Yanlei Diao published in Fall 2005 by University of California Berkeley, which thesis is hereby incorporated by reference herein in its entirety as background.
The just-described thesis describes a system called YFilter implemented as a Nondeterministic Finite Automaton (NFA) which allows incremental maintenance of a DSMS upon query updates. Yanlei states that it is important to note that because of NFA construction, his system uses an incremental process in which new queries can easily be added to an existing DSMS, and that this ease of maintenance is a key benefit of the NFA-based approach.
Another article entitled “Query Processing for High-Volume XML Message Brokering” by Yanlei Diao et al. describe a path matching engine, as an alternative to extending a tree pattern matching approach to support shared processing. This article cites to a system called MatchMaker described by L. V. S. Lakshmanan, P. Sailaja in an article entitled “On efficient matching of streaming XML documents and queries” published in EDBT on March 2002. Yanlei Diao et al.'s article also cites to an article by C. Chan, P. Felber, et al. entitled “Efficient filtering of XML documents with XPath expressions” published in ICDE in Feb. 2002. The reader is requested to review both these articles, each of which is incorporated by reference herein in its entirety, as background.
Another prior art system is described in a paper entitled “ARGUS: Efficient Scalable Continuous Query Optimization for Large-Volume Data Streams” by Chun Jin and Jaime Carbonell, published at 10th International Database Engineering and Applications Symposium (IDEAS'06), pp. 256-262, which is hereby incorporated by reference herein in its entirety as background. ARGUS is a stream processing system that supports incremental operator evaluations and incremental multi-query plan optimization as new queries arrive. The latter is done to a degree well beyond the previous state-of-the-art via a suite of techniques such as query-algebra canonicalization, indexing, and searching, and topological query network optimization.
ARGUS is comprised of two components, a Query Network Generator and an Execution Engine. Upon receiving a request to register a new continuous query Q, the Query Network Generator parses Q, searches and chooses the sharable computations between Q and the existing query network, constructs a shared optimal query evaluation plan, expands the query network to instantiate the plan, records the network changes in the system catalog, and sends the updated execution code of the query network to the engine. The Execution Engine then runs the execution code, and produces new results if newly arrived stream tuples match the queries. Note that ARGUS requires a database management system at the back end to execute queries.
An article published in the Journal of Universal Computer Science, vol. 12, no. 9 (2006), 1165-1176, on Sep. 28, 2006 entitled “Extension of CQL over Dynamic Databases” by Antal Buza is incorporated by reference herein its entirety as background. This article describes extension of a CQL to support a query that is explicitly made sensitive to an update of a relation. More specifically, according to this article, a new keyword RETROACTIVE indicates that the continuous query is to be virtually re-started when a relation is updated (The ‘virtual re-start’ means that the system re-reads all relations and one processes the stream from now). In practice, when there is sufficient memory for the storage of the relation, then the query does not read this relation continuously or repeatedly, but does it immediately after the last update of the relation.