The invention relates to multi-thread processing, more especially to the handling of threads of control within a computer architecture that supports multi-thread processing in hardware.
Multiprogramming, or running multiple processes on a computer having fewer actual processors via time-slicing under operating-system control has been used on computer systems for a long time. Traditionally processes have distinct address spaces and are only permitted to interact via tightly constrained means.
More recently it has become common to provide for separating individual processes into threads of control which share more resources, e.g. address spaces, and so have faster communication between them to enable cooperation on a job, yet still be runnable in parallel on multiple processors. Several existing operating systems, such as Unix and SunOS, can support multi-threading by explicit or compiler-generated programming, whereby multiple, apparently concurrent threads are supported by time-slicing upon fewer actual processors.
An additional use of threads is to follow two or more possible branches of a process. For example, if a process or thread makes a control-path decision based on a binary result which is not yet established, the thread may be copied into two threads so that each can follow one possible value of the binary result. This is referred to as thread duplication. Duplicated threads may run in parallel if two processors are available. The duplicated threads may subsequently replicate further, depending on the subsequent instructions of the thread concerned. Once the binary result becomes established, the thread or threads which followed from that one of the binary results now known to be incorrect can be resolved, i.e. terminated.
To support multi-threading, a computer system thus needs a mechanism by which threads can be managed, including means for creating and terminating threads. The general goal of the designer is, of course, to provide an efficient processing capability.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with those of the independent claims as appropriate and in combinations other than those explicitly set out in the claims.
According to one aspect of the invention a plurality of processing units is provided which are interconnected by communication paths via which the contents of active threads can be transferred between processing units under the control of logic and driver circuits which form part of the processing units and which are arranged to act as intelligent transceiver elements controlling transfer of active threads over the inter-unit communication paths. In operation, thread duplication, for example, will proceed by running one of the threads on the same processing unit and by running the other of the threads on another processing unit. To do this the entire thread state of the originating processing unit is copied to the other processing unit.
Transferring entire thread state of threads from one processing unit to another while the threads are current provides a mechanism by which processing activity in a multi-thread process can be distributed among plural processing units so as to improve processing performance.
A thread transfer may be effected in the course of thread replication or migration. Thread duplication is a special case of thread replication which may involve the splitting of a thread into any plural number of threads. Thread migration involves the movement of a thread of control from one processing unit to another. A thread migration operation is effected by copying the entire state of the originating processing unit to the recipient processing unit and then terminating the thread in the originating processing unit.
In a preferred embodiment of the invention, the concept of activity values is used to provide each processing unit with information that can be used to decide on thread transfer to neighbor processing units. Activity values are passed between neighboring processing units via activity signals to provide each processing unit with an indication of the level of processing activity of the processing units in its vicinity.
For example, if a processing unit is running a thread and encounters an operation on an unknown binary result, then the processing unit may wish to consider whether (a) to generate a thread duplication by transferring one thread to a neighbor processing unit, (b) to stall the thread, or (c) to continue with only one of the possible two branches. To consider option (a), the processing unit can look at the activity values received from each of its neighbor processing units. One or more of the activity values may indicate that the neighbor processing unit concerned is already busy, i.e. running a thread of its own, in which case thread transfer to that processing unit would be excluded. On the other hand, so long as at least one of the received activity values indicates availability to run a thread, there is at least one neighbor processor to which a thread could be transferred. Only if all the received activity values indicate xe2x80x9cbusyxe2x80x9d-ness, will it be necessary to consider options (b) and (c).
An important advantage of the processing unit designs herein described is their scaleability. Scaleability is possible, because each processor unit is autonomous as regards decisions on thread transfer. In particular, the activity value concept as herein described is scaleable provided that the time scales associated with the distribution of activity value signals within the array of processing units are kept shorter than the time scales associated with thread propagation within the array.
Not only may the activity value convey information on nearest, i.e. direct, neighbor processing units but, preferably, also on processing units less directly connected to the processing unit concerned, for example second neighbors or more distant neighbors. This is possible with digital activity values having more than one bit.
The composition of the activity value will depend on the protocol applied by each processing unit for generating activity values for Output. Generally, if a processing unit is active, i.e. currently running a thread, then it will output a BUSY activity value to each of its neighbor processing units. On the other hand, if the processing unit is inactive, i.e. not currently running a thread, then it will output a NOT BUSY activity value to each of its neighbor processing units. Preferably, there will be a plurality of NOT BUSY activity value levels. In this case, the different NOT BUSY activity values may indicate activity levels of processing units in the immediate or more remote proximity of the processing unit concerned and may carry a plurality of contributions from different processing units weighted according not only to proximity, i.e. distance, but also relative position, i.e. direction. The activity value may thus be a composite entity, which, although it will most usually be a scalar quantity, may contain vectorial information. A number of different activity value protocols are discussed further below.
The activity value may be discrete or continuous. In a system designed to operate with discrete activity values, bi-directional digital communication paths are provided between neighboring processing units. In one embodiment, the activity values are discrete, adopting values between 0 and 15, and are conveyed via 4-bit bi-directional communication paths. Other discrete value sets of activity values may be adopted using a multi-bit communication paths of appropriate width. In one embodiment, a two-bit communication path is used to convey a four-state activity value. A primitive implementation, adopted in another embodiment, has only a single bit of width for conveying a two-state activity value. In this case, the activity value conveys only neighbor activity information. Embodiments with a single bit activity value can be implemented with only a single wire.
A multi-thread computer system as herein described may be embodied in a wide variety of topographies. The topography of a system is defined by the neighbor structure of the processing units, i.e. by the communication paths interconnecting the processing units. To give a few examples, the processing units may be connected:
(a) in a line having two ends;
(b) in a ring;
(c) in a square or rectangular mesh;
(d) in an open-ended cylinder;
(e) in a cubic or diamond mesh; or
(f) in a torus.
The processing units may be embodied using a plurality of elements referred to as thread units (TU""s) and one or more sets of elements referred to as execution units (EU""s). Each set of EU""s is associated with a group of TU""s such that a TU of that group only issues instructions to that EU set. The TU groups may consist of one, two or more TU""s. For example, in one embodiment there is a linear array of TU""s and an associated linear array of EU sets, thus providing a one-to-one mapping between TU""s and EU sets. In another embodiment, also based on a linear array of TU""s, there is a three-to-one mapping between TU""s and EU sets, each EU set being shared by three neighbor TU""s. Each EU set preferably includes at least one of each type of execution unit of the system so that each EU set supports all architected instructions. Each processing unit is thus subdivided into a thread unit for handling instruction access and issue, and inter-thread unit communication and an associated execution unit set (which may be shared among several TU""s) for performing the main computational work on active threads of control.
Each TU may be implemented with: a full set of architected registers, or a set of implementation registers plus rename facility to present the appearance thereof; an instruction queue or buffer; an instruction fetch/decode unit; a load/store unit; a branch unit; and an issue or dispatch unit. Internal caches may also be provided. Each TU may be configured to maintain a binary status xe2x80x9cactivexe2x80x9d/xe2x80x9cinactivexe2x80x9d, representing thread processing activity of that TU. A TU in the xe2x80x9cinactivexe2x80x9d state will not fetch or dispatch instructions.
In a processor system capable of tracking in hardware multiple threads of control and having multiple sets of hardware registers to support the threads, it is thus possible to implement speculative execution beyond branches by:
at the unresolved branch duplicating the thread context into an available register set;
executing both threads in parallel until said branch resolves;
canceling the unneeded thread on the basis of the branch resolution, thereby freeing its resources (including register set and, optionally, any operations in progress); and
continuing with the needed thread untouched.
It is possible to extend this procedure to many levels of branch speculation, as many as there are register sets implemented. If upon attempting to execute a branch no register set or other thread-related resource is free, e.g. because several levels of speculation are already running and/or because several processes are running concurrently, then the thread can be stalled, as in a non-speculative execution processing system.
The available processing resources can thus be used effectively, both in a single process environment (by extensive speculative execution and dynamically creating and abandoning threads) and in a multi-process environment.