1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for processing data streams. Still more particularly, the present invention relates to a method, apparatus, and computer usable program code for providing load diffusion to perform data stream correlations using a distributed stream processing system.
2. Description of the Related Art
Many emerging applications call for sophisticated real time processing of data streams. These applications are referred to as stream applications. Examples of stream applications include, for example, stock trading surveillance for fraud detection, network traffic monitoring for intrusion detection, sensor data analysis, and video surveillance. In these stream applications, data streams from external sources flow into a stream processing system, where the data streams are processed by different continuous query processing elements called “operators”. These processing elements or operators may take the form of software, hardware, or some combination thereof.
To support unbounded streams, the stream processing system associates a sliding window with each stream. The window contains the most recently arrived data items on the stream called tuples. A tuple is a set of values. The window can be time-based or tuple-based. A time based window may be, for example, tuples arriving in the last 10 minutes, while a tuple based window may be, for example, the last 1000 tuples. One of the important continuous query operators is sliding window join between two streams, streams S1 and S2. The output of this window join contains every pair of tuples from streams S1 and S2 that satisfies the join predicate and are simultaneously present in their respective windows.
The join predicate is a comparison function over one or more common attributes between two tuples. The basic join predicate is an equality comparison between two tuples s1 and s2 over a common attribute A, denoted by s1·A=s2·A. The sliding window join has many applications. For example, consider two streams in which one stream contains phone call records and the other stream contains stock trading records. A sliding window join that operates to correlate or join between the suspicious phone calls and anomalous trading records over the common attribute “trade identifier” can be used to generate trading fraud alerts.
In many cases, stream applications require immediate on-line results, which implies that query processing should use in-memory processing as much as possible. However, given high stream rates and large window sizes, even a single sliding window join operator can have a large memory requirement. Moreover, some query processing, such as video analysis can also be computation-intensive. Thus, a single server may not have sufficient resources to produce accurate join results while keeping up with high input rates. Currently, two basic solutions are present to address the challenge: shedding part of workload by providing approximate query results, or offloading part of workload to other servers.
Much work on stream processing has been performed to provide efficient resource management for a single server site. To further scale up stream processing, recent work has proposed to process high-volume data streams using distributed stream processing systems. This solution proposes a dynamic load distribution algorithm that provides coarse-grained load balancing at inter-operator level. However, the inter-operator load distribution alone may not be sufficient since this type of load distribution does not allow a single operator to collectively use resources on multiple servers. For example, if an operator requires 40 KB memory while each single server has only 39 KB available memory, the coarse-grained scheme cannot execute the operator with full precision although the whole server cluster has sufficient available memory. Another solution has studied the intra-operator load distribution for processing a single windowed aggregate operator on multiple servers.
Although load balancing and load sharing have been extensively studied in conventional distributed and parallel computing environments, they are not directly applicable to dynamic stream environments. First, stream processing system executes long-running query computations over unbounded data streams. Thus, any static load distribution algorithms cannot be used since load conditions can vary in unpredictable ways. Second, the load punctuation in the stream processing system is caused not only by different queries presented to the system, but also changing stream rates that can present transient spikes. Resource management must adapt to dynamic stream environments since the system cannot control the rates of input streams from external sources. Third, windowed stream joins require the load balancing algorithm to satisfy a new correlation constraint that correlated tuples must be sent to the same server for producing accurate join results. The correlated tuples include those tuples that need to be joined based on the sliding window definition.