Traditional database management systems (DBMS) deal with persistent data sets that are reliably stored and may be accessed multiple times during any query. In several important application domains, however, data arrives continuously and needs to be processed in a single pass. Such continuous data-streams arise naturally in a number of applications including telecommunication networks, retail chain transactions and banking automated teller machine (ATM) transactions.
In order to monitor these data-streams and detect patterns that may, for instance, indicate fraudulent use, equipment malfunction or non-optimal configuration, it is necessary to query these data-streams in real time using algorithms that only have access to each data element in the stream once, in the arbitrary order in which the data element appears in the data-stream. Because of the limitations of the computers doing the monitoring it is also necessary that these algorithms use only a relatively small amount of memory. Moreover, the need for real-time answers means that the time for processing each element must also be small.
Estimating the cardinality of set expressions is one of the most fundamental classes of queries. Such set expressions are an integral part of standard structured query language (SQL) queries, which supports UNION, INTERSECT and EXCEPT queries. (The SQL EXCEPT query is a set-difference query).
In order to calculate set-expression cardinality, standard SQL programs make multiple passes over complete sets of stored data. Such algorithms are not capable of providing answers to such queries when the data arrives in the form of streaming data, without storing all the data.