1. Field of the Invention
The present invention relates generally to data stream query processing, and more particularly to stream processing that is independent of order.
2. Description of Related Art
Research and early commercial work in stream query processing viewed data streams as continuous sequences of data that arrived more or less in order. This assumption heavily influenced the architecture of first generation stream-processing systems. Such systems typically tolerate only small degrees of delay and out-of-orderness in the arrival of streaming data, and do so at the expense of introducing mandatory latencies and potentially expensive buffering into what should be high-speed, lightweight stream processing. In cases where the discontinuity of the input exceeds the limits that can be handled by such systems, the incoming data is often simply dropped, or at best, spooled to back-up storage for eventual (and often manual) integration into the system.
In practice, however, it turns out that streams are rarely continuous for several reasons. For example, in distributed environments, data from multiple sources can arrive in arbitrary order even under normal circumstances. Failure or disconnection followed by subsequent recovery of connectivity between remote sites causes even larger discontinuities as sites that have been down or disconnected for large periods of time finally wake up and start transmitting old, but important data. A similar pattern of events unfolds in the event of temporary disruptions such as network partitions between datacenters connected over WAN environments. Parallel execution techniques that are critical in scaling up and out in multi-core and cluster systems break the sequential/in-order nature of stream processing. Finally, high availability mechanisms for streaming systems, in which a recovering server must obtain missing data from other sites, create situations where data arrives piecemeal and not necessarily in order.
The problems described above are particularly acute in the emerging “Big Data” applications where stream processing systems are being increasingly used. Consider for example, the web-based digital media ecosystem of organizations delivering various services (e.g., social networking, advertising, video, mobile etc.) to internet-scale audiences. Such services operate in highly dynamic environments where monitoring and event data inherently arrives at multiple time-scales, and query results, analyses and predictions are needed across multiple time-scales. In such environments the source data is typically from log files spooled by large banks of distributed web/application servers. Failures in the source systems are commonplace and the log files are often delivered to the analytics system hours or sometimes days late. Finally, these services are more and more frequently deployed in cloud environments and have stringent availability requirements that must also be satisfied.