1. Field of the Invention
This invention relates to systems and methods for processing a stream of events.
2. Background of the Invention
MapReduce has emerged as a popular method for processing large data sets or “big data.” Using MapReduce, a developer simply writes a map function and a reduce function. The system automatically distributes the workload over a cluster of commodity machines, monitors the execution, and handles failures. In the past few years, however, not just big data, but fast data, i.e., high-speed real-time and near-real-time data streams, has also exploded in volume and availability. Prime examples include sensor data streams, real-time stock market data, and social-media feeds such as Twitter, Facebook, YouTube, Foursquare, and Flickr. The emergence of social media in particular has greatly fueled the growth of fast data, with well over 4000 tweets per second (400 million tweets per day), 3 billion Facebook likes and comments per day, and 5 million Foursquare checkins per day.
Numerous applications that deal with these and similar data streams must process fast data, often with minimal latency and high scalability. For example, an application that monitors the Twitter Firehose for an ongoing earthquake may want to report relevant information within a few seconds of when a tweet appears, and must handle drastic spikes in the tweet volumes.
MapReduce is not particularly suited for fast data. First, MapReduce runs on a static snapshot of a data set, while stream computations proceed over an evolving data stream. In MapReduce, the input data set does not (and cannot) change between the start of the computation and its finish, and no reducer's input is ready to run until all mappers have finished. In stream computations, the data is changing all the time; there is no such thing as working with a “snapshot” of a stream. Second, every MapReduce computation has a “start” and a “finish.” In contrast, stream computations never end—the data stream goes on forever. In the MapReduce model, the reduce step needs to see a key and all the values associated with the key; this is impossible in a streaming model.
Accordingly, what is needed is an improved method for performing a map-reduce type operation for streaming data with very low latency.