1. Technical Field
The present disclosure relates to data streaming and, more particularly, to methods and systems for reconfiguration and repartitioning of a parallel distributed stream process.
2. Discussion of Related Art
Distributed computer systems have been widely employed to run jobs on multiple machines or processing nodes, which may be co-located within a single cluster or geographically distributed over wide areas, where some jobs run over long periods of time. In order to scale a job, multiple computers can be connected over a common network to form a cluster. An example of a system and method for large-scale data processing is MapReduce, disclosed in U.S. Pat. No. 7,650,331, entitled “System and method for efficient large-scale data processing”. MapReduce is a programming model and library designed to simplify distributed processing of huge datasets on large clusters of computers and largely relieves the programmer from the burden of handling distributed computing tasks such as data distribution, process coordination, fault tolerance, and scaling.
System scalability can be achieved in several ways. In one approach, the process may be broken into a number of operators, which can be chained together in a directed acyclic graph (DAG), such that the input, output and/or execution of one or more operators is dependent on one or more other operators. In another approach, individual operators are partitioned into multiple independently operating units and each operator works on a subset of the data. In this approach, the result of all operators is redistributed to the next set of operators in the computational graph.
During the life time of a job the workload may fluctuate substantially and data may be unevenly distributed between operators. A common approach to address the resource allocation problem is to break the job into a substantially larger number of sub-tasks than available nodes and schedule them in priority order. Long running jobs may be interspersed with shorter jobs thereby minimizing overall execution time. A number of algorithms to achieve efficient job scheduling with different performance and efficiency objectives have been researched and published.
In a stream processing system with a distributed computational model, data constantly flows from one or more input sources through a set of operators (e.g., filters, aggregates, and correlations), which are usually identified by a network name in the cluster, to one or more output sinks. The individual operators generally produce result sets that are either sent to applications or other nodes for additional processing. In general, the output of an operator can branch to multiple downstream operators and can be combined by operators with multiple inputs. This form of computation and data transport is called a stream. Commonly, due to low-latency communication requirements in a stream process, the individual operators in the stream are long lived. Once the operators are instantiated on a node, and after the network connections have been established, the job configuration typically stays fixed except in cases of hardware and/or software failures.
In a stream process, fixed job configurations generally lead to sub-optimal resource utilization. Commonly the system designer will provision the system for the peak load situation, and during non-peak times the system will be underutilized. An alternative approach which may avoid overprovisioning is to use a virtual machine environment and compress the virtual worker machines onto fewer physical nodes, thereby freeing up the physical resources. In conventional stream processing methods, it is generally assumed that once a network connection between operators has been established, the flow configuration remains static.