Nowadays computer systems are increasingly employing parallel computing techniques. This refers to the case where multiple separate processing units are provided in parallel. For instance, parallelism can be implemented by means of a processor comprising an arrangement of multiple tiles on the same chip (i.e. same die), each tile comprising its own separate respective processing unit and memory (including program memory and data memory). Thus separate portions of program code can be run in parallel on different ones of the tiles. The tiles are connected together via an on-chip interconnect which enables the pieces of code run on the different tiles to communicate with one another between tiles. In some cases, multiple different processors on different chips may be connected together via an external interconnect with each processor comprising multiple tiles. Hence it is possible to connect together multiple independent processing resources with a high degree of parallelism.
An example application of parallel processing is found in machine intelligence. As will be familiar to those skilled in the art of machine intelligence, a machine intelligence algorithm is based around performing iterative updates to a “knowledge model”, which can be represented by a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive the inputs to the graph and some receive inputs from one or more other nodes, whilst the output of some nodes form the inputs of other nodes, and the output of some nodes provide the output of the graph (and in some cases a given node may even have all of these: inputs to the graph, outputs from the graph and connections to other nodes). Further, the function at each node is parameterized by one or more respective parameters, e.g. weights. During a learning stage the aim is, based on a set of experiential input data, to find values for the various parameters such that the graph as a whole will generate a desired output for a range of possible inputs. Various algorithms for doing this are known in the art, such as a back propagation algorithm based on stochastic gradient descent. Over multiple iterations based on the input data, the parameters are gradually tuned to decrease their errors, and thus the graph converges toward a solution. In a subsequent stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make inferences as to inputs (causes) given a specified set of outputs.
The implementation of each node will involve the processing of data, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least some of the processing of each node can be carried out independently of some or all others of the nodes in the graph, and therefore large graphs expose great opportunities for parallelism.
However, as computer systems grow beyond simple single-processor, single-core devices, it becomes necessary to determine how the program is to be split between the different parallel resources. Conventionally this is either specified manually by the programmer, or alternatively certain tools exist which attempt to divide the compute processing burden evenly across the parallel processing resources.