A multi-threaded processor is a processor which is capable of executing multiple program threads alongside one another. The processor may comprise some hardware that is common to the multiple different threads (e.g. a common instruction memory, data memory and/or execution unit); but to support the multi-threading, the processor also comprises some dedicated hardware specific to each thread.
The dedicated hardware comprises at least a respective context register file for each of the number of threads that can be executed at once. A “context”, when talking about multi-threaded processors, refers to the program state of a respective one of the threads being executed alongside one another (e.g. program counter value, status and current operand values). The context register file refers to the respective collection of registers for representing this program state of the respective thread. Registers in a register file are distinct from general purpose memory in that register addresses are fixed as bits in instruction words, whereas memory addresses can be computed by executing instructions. The registers of a given context typically comprise a respective program counter for the respective thread, and a respective set of operand registers for temporarily holding the data acted upon and output by the respective thread during the computations performed by that thread. Each context may also have a respective status register for storing a status of the respective thread (e.g. whether it is paused or running). Thus each of the currently running threads has its own separate program counter, and optionally operand registers and status register(s).
One possible form of multi-threading is parallelism. That is, as well as multiple contexts, multiple execution pipelines are provided: i.e. a separate execution pipeline for each stream of instructions to be executed in parallel. However, this requires a great deal of duplication in terms of hardware.
Instead therefore, another form of multi-threaded processor employs concurrency rather than parallelism, whereby the threads share a common execution pipeline (or at least a common part of a pipeline) and different threads are interleaved through this same, shared execution pipeline. Performance of a multi-threaded processor may still be improved compared to no concurrency or parallelism, thanks to increased opportunities for hiding pipeline latency. Also, this approach does not require as much extra hardware dedicated to each thread as a fully parallel processor with multiple execution pipelines, and so does not incur so much extra silicon.
One form of parallelism can be achieved by means of a processor comprising an arrangement of multiple tiles on the same chip (i.e. same die), each tile comprising its own separate respective processing unit and memory (including program memory and data memory). Thus separate portions of program code can be run in parallel on different ones of the tiles. The tiles are connected together via an on-chip interconnect which enables the code run on the different tiles to communicate between tiles. In some cases the processing unit on each tile may itself run multiple concurrent threads on tile, each tile having its own respective set of contexts and corresponding pipeline as described above in order to support interleaving of multiple threads on the same tile through the same pipeline.
In general, there may exist dependencies between the portions of a program running on different tiles. A technique is therefore required to prevent a piece of code on one tile running ahead of data upon which it is dependent being made available by another piece of code on another tile. There are a number of possible schemes for achieving this, but the scheme of interest herein is known as “bulk synchronous parallel” (BSP). According to BSP, each tile performs a compute phase and an exchange phase in an alternating cycle. During the compute phase each tile performs one or more computation tasks locally on tile, but does not communicate any results of its computations with any others of the tiles. In the exchange phase each tile is allowed to exchange one or more results of the computations from the preceding compute phase to and/or from one or more others of the tiles in the group, but does not yet proceed to the next compute phase. Further, according to the BSP principle, a barrier synchronization is placed at the juncture transitioning from the compute phase into the exchange phase, or transitioning from the exchange phase into the compute phase, or both. That is it say, either: (a) all tiles are required to complete their respective compute phases before any in the group is allowed to proceed to the next exchange phase, or (b) all tiles in the group are required to complete their respective exchange phases before any tile in the group is allowed to proceed to the next compute phase, or (c) both. In some scenarios a tile in the compute phase may be allowed to communicate with other system resources such as a network card or storage disk, as long as no communication with other tiles in the group is involved.
An example use of multi-threaded and/or multi-tiled processing is found in machine intelligence. As will be familiar to those skilled in the art of machine intelligence, a machine intelligence algorithm is based around performing iterative updates to a “knowledge model”, which can be represented by a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive the inputs to the graph and some receive inputs from one or more other nodes, whilst the output of some nodes form the inputs of other nodes, and the output of some nodes provide the output of the graph (and in some cases a given node may even have all of these: inputs to the graph, outputs from the graph and connections to other nodes). Further, the function at each node is parameterized by one or more respective parameters, e.g. weights. During a learning stage the aim is, based on a set of experiential input data, to find values for the various parameters such that the graph as a whole will generate a desired output for a range of possible inputs. Various algorithms for doing this are known in the art, such as a back propagation algorithm based on stochastic gradient descent. Over multiple iterations based on the input data, the parameters are gradually tuned to decrease their errors, and thus the graph converges toward a solution. In a subsequent stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make inferences as to inputs (causes) given a specified set of outputs.
The implementation of each node will involve the processing of data, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least some of the processing of each node can be carried out independently of some or all others of the nodes in the graph, and therefore large graphs expose great opportunities for concurrency and/or parallelism.