The present invention relates to data processing, and more specifically, to processing data streams. Data stream processing is typically referred to as the in-memory, record-by-record analysis of machine data in motion. A common objective of data stream processing is to extract actionable intelligence as streaming analytics, and to react to operational exceptions through real-time alerts and automated actions in order to correct or avert the problem. The data streams that are processed are typically unstructured log records and sensor events, with each record including a timestamp indicating the exact time of data creation or arrival.
Over the past few years, there has been a significant increase in machine-generated data from logs, sensors, networks and devices, which has led to an exponential increase in data volume. This increase has been happening in parallel with a developing need for real-time so-called “Big Data” applications, as enterprises typically want to extract greater value from their real-time Big Data asset.
However, applications based on traditional “store-first, process-second”-data management architectures are unable to scale for real-time Big Data applications, primarily due to the latency and throughput requirements for real-time applications in industries such as telecom, Internet of Things (IOT) and cyber-security.
Data stream processing, on the other hand, is a programming paradigm that naturally exposes task and pipeline parallelism. Streaming applications are directed graphs where vertices are operators and edges are data streams. Because the operators are independent of each other, and are fed continuous streams of tuples, they can naturally execute in parallel. The only communication between operators is through the streams that connect them. When operators are connected in chains, they expose inherent pipeline parallelism. When the same streams are fed to multiple operators that perform distinct tasks, they expose inherent task parallelism. This makes them popular in environments where high throughput, low latency applications are required that can scale with both the number of cores in a machine, and with the number of machines in a cluster.
While pipeline and task parallelism occur naturally in stream graphs, data parallelism requires intervention. In the streaming context, data parallelism involves splitting data streams and replicating operators. The parallelism obtained through replication can be more well balanced than the inherent parallelism in a particular stream graph, and is easier to scale to the resources at hand. Such data parallelism allows operators to take advantage of additional cores and hosts that the task and pipeline parallelism are unable to exploit.
Extracting data parallelism by hand is possible, but is usually cumbersome. Developers must identify where potential data parallelism exists, while at the same time considering if applying data parallelism is safe. The difficulty of developers doing this optimization by hand grows quickly with the size of the application and the interaction of the subgraphs that comprise it. After identifying where parallelism is both possible and legal, developers may have to enforce ordering on their own. All of these tasks are tedious and error-prone. Further, unless an operator was explicitly written as a parallel or threaded operator, it may not be clear how to add processing resources. Yet further, explicitly creating parallel operator regions when volume or velocity of data is low is a waste of resources, especially in a cloud setting. Thus, there is a need for improved data stream processing techniques.