The present invention is generally directed to performing queries on data streams. More specifically, the present invention is directed to a method and system for eliminating join operations from queries on data streams by approximating the join operations.
Typically, databases are used to store large amounts of data. A database query is used to analyze the data stored in a database. More particularly, a database query specifies a result to be calculated using the data stored in the database. A database query is often specified using structured query language (SQL). Join operations are common in database queries. Join operations join multiple tables of a database by requesting data from one table that matches data from another table.
Data streams are sequences of data used to transmit or receive information. Data streams are used to transmit large amounts of data in a small amount of time. Examples of data streams include network traffic, financial data such as stock market data, sensor readings, military data, etc. Typically, it is impossible or inconvenient to store all of the data received in a data stream due to the large amounts of data being transmitted. However, it is often necessary to analyze this data. A data stream management system (DSMS) is typically a computer program which monitors data streams and performs operations on the data in data streams. In a conventional DSMS, queries are performed on data streams arriving at the DSMS. In order to perform the queries, data from the data streams is temporarily stored and then deleted after the query is completed.
Because an entire data stream is likely to be too large to process at once, continuous queries are used to perform queries on data streams. In performing a continuous query, a DSMS continuously evaluates data in a data stream as it arrives and reports results of the query over a specified time window or grouping granularity. Queries requiring join operations on data streams often have temporal join conditions which requests matching data from one stream with data from another stream arriving within a specified time. Therefore, the DSMS must store data from each arriving stream for the duration of the temporal join condition or until a match is found, and compare each piece of data arriving for each stream with the data stored for the other streams to determine if a match exists. As the speed of data streams increase, the amount of data stored in order to perform a join operation increases. Accordingly, the storage and computational costs of join operations make join operations inefficient or impossible in high speed data stream applications. For example, a network router can forward packets at a great speed, typically spending a few hundred nanoseconds per packet. In order for a DSMS to perform a query requiring a join operation joining multiple streams of packets, even if the temporal join condition is only a few seconds, the storage and computational requirements needed to perform the join operation would make this operation infeasible.
Accordingly, it is desirable to perform queries on data streams without performing join operations. Furthermore, it is desirable to accurately approximate the join operations without using the storage and computational resources required to perform the join operations.