In distributed data streaming applications, rapid update streams originating at tens or hundreds of remote sites are continuously transmitted to a central processing system for online querying and analysis. Examples include monitoring of service provider network traffic statistics, telecommunication call detail records, Web usage logs, financial stock tickers, retail chain transactions, weather data, sensor data, and so on.
An important consideration in the above-mentioned monitoring applications is the communication overhead imposed by the distributed query processing architecture on the underlying network. Specifically, transmitting every update stream to a central site for processing can lead to inordinate amounts of message traffic, and thus have a crippling effect on the communication infrastructure as well as the central site processor.
For many distributed stream-oriented applications, exact answers are not required and approximations with guarantees on the amount of error suffice. The tradeoff between answer accuracy and communication overhead for specific classes of continuous queries over distributed update streams has been studied recently.
One approach considers aggregation queries that compute sums and averages of dynamically changing numeric values spread over multiple sources. In this approach, each site is assigned an interval of a certain width such that the sum of site interval widths is less than the application's total error tolerance. Thus, as long as the numeric value at each site stays within the interval for the site, no messages need to be sent by the sites in order to satisfy the application's accuracy requirements. However, in case the value at a site drifts outside the site's interval, the site is required to transmit the value to the central site and make appropriate adjustments to its interval.
Another approach focuses on the problem of continually tracking top-k values in distributed data streams; the developed techniques ensure the continuing validity of the current top-k set (at the central site) by installing arithmetic constraints at each site.
Unfortunately, most existing approaches for processing data streams are primarily concerned with exploring space-accuracy tradeoffs (mostly for single streams) rather than communication-accuracy tradeoffs in a distributed streams setting.