Large volumes of data in the form of continuous data-streams are generated by a number of applications including telecommunication networks, retail chain transactions and banking automated teller machine (ATM) transactions.
In order to monitor these data-streams and detect patterns that may, for instance, indicate fraudulent use, equipment malfunction or non-optimal configuration, it is desirable to query these data-streams in real time using algorithms that only have access to each data element in the stream once and in the arbitrary order in which the data element appears in the data-stream. Because of the limitations of the computers doing the monitoring, it is also desirable that these algorithms use only a relatively small amount of memory. Moreover, the need for real-time answers means that the time for processing each element should also be small.
A particularly desirable form of monitoring is the ability to perform queries on these data-streams that are similar to the structured query language (SQL) queries performed on more traditional fixed data bases.
For instance, a telecommunications network operator might want to know how many subscribers in a particular area are experiencing incomplete calls. In a traditional relational database, this question would be answered by examining two tables, the first table relating subscribers to their location, and the second table relating subscribers to incomplete calls. In particular, a SQL join of the two tables would be preformed to create a new table relating the subscribers in a particular location to incomplete calls, i.e., a table of subscribers in that location who are experiencing incomplete calls. The required result is the number of subscribers in the new table, i.e., the required results is the size of the join.
The problem is how to provide a reasonably accurate approximate answer to such SQL-like queries over join operations, such as calculating the size of a join, when the data is arriving in a data-stream and each data element can only be examined once. Moreover, the estimated answer needs to be provided in real time using limited computer memory.