1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to a computer implemented method and apparatus for processing data streams. Still more particularly, the present invention relates to a computer implemented method, apparatus, and computer usable program code for scalable processing of multi-way data stream correlations.
2. Description of the Related Art
Stream processing computing applications are applications in which the data comes into the system in the form of information flow, satisfying some restriction on the data. Note that volume of data being processed may be too large to be stored; therefore, the information flow calls for sophisticated real-time processing over dynamic data streams, such as sensor data analysis and network traffic monitoring. Examples of stream processing computing applications include video processing, audio processing, streaming databases, and sensor networks. In these applications, data streams from external sources flow into a data stream management system where they are processed by different continuous query operators.
To support unbounded streams, the stream processing system associates a sliding-window with each stream. The sliding-window contains the most recently arrived data items on the stream. The window may be either time-based, such as video frames arrived in the last 60 seconds or number-based, such as the last 1000 video frames. One of the most important continuous query operators is sliding-window join over multiple different data streams. The output of the sliding-window join contains all sets of correlated tuples that satisfy a pre-defined join predicate and are simultaneously present in their respective windows.
Some example applications include searching similar images among different news video streams for hot topic detection and correlating source/destination addresses among different network traffic flows for intrusion detection. Key-based equijoins may be less effective because many stream correlation applications demand more complex join predicates than key comparisons. For example, in a news video correlation application, the join condition is whether the distance between two images' 40-dimensional classification values is below a threshold value. Thus, correlating data of different streams means to find those data on different streams that satisfy one or more pre-defined correlation predicates.
A major challenge for processing multi-way stream joins is to perform a large number of join comparisons over multiple high-volume and time-varying data streams in real-time. Given high stream rates and large window sizes, windowed stream joins may have large memory requirements. Moreover, some query processing, such as image comparison may also be central processing unit-intensive. A single host may be easily overloaded by the multi-way stream join workload.