The ever increasing rate of digital information available from on-line sources drives the need for building information monitoring applications to assist users in tracking relevant changes and accessing information that is of interest to them in a timely manner. This growth in the amount of digital information available is partly due to the emergence of a wide and growing range of network-connected hard and soft sensors. These on-line sources are increasingly taking the form of data streams, i.e., time-ordered series of events or readings. Some examples include network statistics, financial tickers, and environmental sensor readings. Continuous Query (CQ) systems (see, e.g., L. Liu et al., “Continual Queries for Internet Scale Event-driven Information Delivery,” IEEE Transactions on Data and Knowledge Engineering, TKDE, July/August 1999) that can evaluate standing queries over data streams are used to integrate these sources into today's information processing tasks, and have led to the development of more comprehensive Data Stream Management Systems or DSMSs (see, e.g., N. Jain et al., “Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core,” Proceedings of ACM SIGMOD, 2006).
In the present application, we address a fundamental problem in stream-based CQ systems, namely adaptive source filtering and load shedding. With a large number of high-volume source streams and soaring data rates at peak times, stream processing systems demand adaptive load shedding. Otherwise, the CQ server may randomly drop data items when data arrival rates exceed the capacity of the CQ server to handle them, which could significantly degrade the quality of the query results.
To mitigate this problem, filtering can be applied to reduce the amount of data fed into a CQ server. In general, the high-level CQs installed in a stream processing system are not interested in receiving the entire source streams. Hence, a set of predicates can be defined in the data space to specify which portions of the source streams are needed by each CQ. Such predicates can be maintained by a query index to filter out the stream elements that are irrelevant (see, e.g., K.-L. Wu et al., “Query Indexing with Containment-encoded Intervals for Efficient Stream Processing,” Springer Knowledge and Information Systems, KAIS, January 2006). However, it would not be effective in load reduction if such query indexes are maintained at the CQ server, because the data items are still received from the source nodes and processed against the query index by the CQ server.
A common approach to addressing this problem in distributed data processing is to move filtering close to the data sources, i.e., source filtering. With source filtering, only the relevant data items matching the predicates will be received and processed by the CQ server. However, there is a challenge with such a source filtering approach. Source nodes are usually computationally less powerful sensor nodes. Hence, shipping the complete query index to the source nodes is usually not possible or is performance-wise undesirable. As a result, a compact summary representation of the query index is often needed to perform effective source filtering. Note that the term “query” typically refers to the predicates in the query index that define the interested portions of the source streams, and “filter” typically refers to the summary structure shipped to the source nodes for filtering out the irrelevant or less useful data items.
Yet, source filtering in itself is not sufficient to prevent the CQ server from randomly dropping data items, even though source filtering can reduce the server's load. This is because the amount of data matching the predicates defined by the source filters can still surpass the capacity of the CQ server. For instance, a CQ with a low selective predicate can easily overwhelm the CQ server even with source filtering.
Hence, intelligent source filters are needed that perform adaptive load shedding on data items that match predicates.