In the context of this document, distributed computing refers to hardware and software systems containing multiple processing elements and concurrent processes running under loose control. In particular, in distributed computing, a program is split into parts that run simultaneously on multiple computers communicating over a network. In contrast, parallel computing involves simultaneously running program segments on multiple processors of a single machine. Distributed computing must address heterogeneous environments, network links of varying latencies and unpredictable failures within the network of computers.
A query processing task to be performed in a distributed environment is split into operators. An operator is a unit of work to complete a sub-task associated with the task. The unit of work may be an operational code (opcode) or set of opcodes. An opcode is the portion of a machine language instruction that specifies an operation to be performed. The specification and format of an operator are defined by the instruction set architecture of the underlying processor. A collection of operators forms a data processing operation that executes in a pipelined fashion. An operator works on objects. As used herein, an object refers to operands or data that are processed by an operator. In a distributed computing environment, objects are commonly processed as batches, partitions, keys and rows. A batch is a large collection of data. Partitions define the division of data within a batch. Keys correlate a set of data within a partition. Each key has an associated set of data, typically in one or more rows or tuples.
Existing distributed computing systems execute query processing tasks in accordance with a static set of resources and a static sequence of operator execution. FIG. 1 illustrates a distributed computing workflow utilized in accordance with the prior art. A daily statistics collector 1 produces statistics regarding source data (e.g., tables) in the distributed computing environment. This results in data distribution statistics 2. A parser 3 parses a query (e.g., a task) to be computed in the distributed computing environment. The parsed or divided query is then processed by a compiler 4. The compiler divides the task into operators. This operation relies upon the data distribution statistics 2 and execution statistics. In particular, the compiler uses sophisticated compilation strategies to generate the best distributed processing resource utilization plan for the operators. The operators are then executed 5. Execution statistics are then generated and stored.
The technique illustrated in FIG. 1 relies upon data distribution statistics characterizing past operation of the distributed computing environment. In other words, the execution plan does not rely upon the current state of the distributed computing environment. The execution plan also relies upon a static resource allocation based upon past network performance. In addition, a static order of operator execution is utilized. The static nature of this approach does not accommodate existing situations in the distributed computing environment.
The preceding paragraph discussed query processing in particular because query processing has the most formal model of execution. However, the problem of static resource allocation applies to distributed programs in general.
It would be desirable to execute tasks in a distributed computing environment in a manner that addresses the existing state of the environment. More particularly, it would be desirable to dynamically allocate resources in a distributed computing environment in response to discontinuous operator execution that surveys existing conditions in a distributed computing environment.