Parallelism in computing takes different forms. Program fragments may be organised to execute concurrently (where they overlap in time but may share execution resources) or in parallel where they execute on different resources possibly at the same time.
Parallelism in computing can be achieved in a number of ways, such as by means of an array of multiple interconnected processor tiles, or a multi-threaded processing unit, or indeed a multi-tile array in which each tile comprises a multi-threaded processing unit.
When parallelism is achieved by means of a processor comprising an array of multiple tiles on the same chip (or chips in the. same integrated circuit package), each tile comprises its own separate respective processing unit with local memory (including program memory and data memory). Thus separate portions of program code can be run concurrently on different tiles. The tiles are connected together via an on-chip interconnect which enables the code run on the different tiles to communicate between tiles. In some cases the processing unit on each tile may take the form of a barrel-threaded processing unit (or other multi-threaded processing unit). Each tile may have a set of contexts and an execution pipeline such that each tile can run multiple interleaved threads concurrently.
In general, there may exist dependencies between the portions of a program running on different tiles in the array. A technique is therefore required to prevent a piece of code on one tile running ahead of data upon which it is dependent being made available by another piece of code on another tile. There are a number of possible schemes for achieving this, but the scheme of interest herein is known as “bulk synchronous parallel” (BSP). According to BSP, each tile performs a compute phase and an exchange phase in an alternating manner. During the compute phase each tile performs one or more computation tasks locally on tile, but does not communicate any results of its computations with any others of the tiles. In the exchange phase each tile is allowed to exchange one or more results of the computations from the preceding compute phase to and/or from one or more others of the tiles in the group, but does not yet begin a new compute phase until that tile has finished its exchange phase. Further, according to this form of BSP principle, a barrier synchronization is placed at the juncture transitioning from the compute phase into the exchange phase, or transitioning from the exchange phases into the compute phase, or both. That is it say, either: (a) all tiles are required to complete their respective compute phases before any in the group is allowed to proceed to the next exchange phase, or (b) all tiles in the group are required to complete their respective exchange phases before any tile in the group is allowed to proceed to the next compute phase, or (c) both. When used herein the phrase “between a compute phase and an exchange phase” encompasses all these options.
An example use of multi-threaded and/or multi-tiled parallel processing is found in machine intelligence. As will be familiar to those skilled in the art of machine intelligence, machine intelligence algorithms “are capable of producing knowledge models” and using the knowledge model to run learning and inference algorithms. A machine intelligence model incorporating the knowledge model and algorithms can be represented as a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive the inputs to the graph and some receive inputs from one or more other nodes. The output activation of some nodes form the inputs of other nodes, and the output of some nodes provide the output of the graph, and the inputs to the graph provide the inputs to some nodes. Further, the function at each node is parameterized by one or more respective parameters, e.g. weights. During a learning stage the aim is, based on a set of experiential input data, to find values for the various parameters such that the graph as a whole will generate a desired output for a range of possible inputs. Various algorithms for doing this are known in the art, such as a back propagation algorithm based on stochastic gradient descent. Over multiple iterations the parameters are gradually tuned to decrease their errors, and thus the graph converges toward a solution. In a subsequent stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make inferences as to inputs (causes) given a specified set of outputs, or other introspective forms of analysis can be performed on it.
The implementation of each node will involve the processing of data, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least some of the processing of each node can be carried out independently of some or all others of the nodes in the graph, and therefore large graphs expose opportunities for huge parallelism.