In general, traditional data management applications (e.g., managing sales records, transactions, inventory, and the like) typically require database support for a variety of one-shot queries in which data processing is done once in response to a posed query. As such, traditional database systems are typically optimized for performance based on one-shot queries. In recent years, however, the emergence of a new class of large-scale event monitoring applications has introduced numerous data management challenges. As opposed to one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data distribution summaries over collections of physically distributed streams.
For example, a network operations center (NOC) for an Internet Protocol backbone network of a large Internet Service Provider typically monitors hundreds of routers, thousands of links and interfaces, as well as various events at different layers of the network infrastructure (e.g., fiber-cable utilizations, packet forwarding at routers, and the like). As such, the NOC must continuously track patterns of usage levels in order to detect and react to traffic floods, link failures, network intrusions, network attacks, and the like. Unfortunately, existing solutions for continuously monitoring distributed tracking environments tend to be space inefficient (at each remote monitoring site), communication inefficient (across the underlying communication network), and unable to provide continuous, guaranteed-quality estimates.
In general, recent work on data stream processing has focused on developing space-efficient, one-pass algorithms for performing various centralized, one-shot computations on massive data streams. Unfortunately, since such existing methods work in a centralized, one-shot setting, the methods do not optimize communication efficiency. Furthermore, although recent solutions include methods that optimize site communication costs for approximating different queries in a distributed setting, the underlying assumption of such methods is that the computation is triggered either periodically or in response to a one-shot request. As such, existing techniques are not applicable to a continuous monitoring environment in which the goal is to continuously provide guaranteed-quality estimates over a distributed collection of data streams.