Query-processing algorithms for conventional Database Management Systems (DBMS) typically rely on several passes over a collection of static data sets in order to produce an accurate answer to a user query. However, there is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. These streams in general comprise update operations (insertions, deletions and the like).
Providing even approximate answers to queries over continuous data streams is a requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. A large network processes data traffic and provides measurements of network performance, network routing decisions and the like. Other application domains giving rise to continuous and massive update streams include retail-chain transaction processing (e.g., purchase and sale records), ATM and credit-card operations, logging Web-server usage records, and the like.
For example, assume that each of two routers within the network provides a respective update stream indicative of packet related data, router behavior data and the like. It may be desirable for the data streams from each of the two routers to be correlated. Traditionally, such streams are correlated using a JOIN operation, which is used to determine, for example, how many of the tuples associated with routers R1 and R2 have the same destination IP address (or some other inquiry). In the case of this JOIN query, the two data sets (i.e., those associated with R1 and R2) are joined and the size of the relevant joined set is determined (e.g., how many of the tuples have the same destination address).
The ability to estimate the number of distinct (sub)tuples in the result of a join operation correlating two data streams (i.e., the cardinality of a projection with duplicate elimination over a join) is an important goal. Unfortunately, existing query processing solutions are unable to provide sufficient responses to complex “Join-Distinct” estimation problems over data streams.