The invention relates to a parallel multithread processor (PMT) with split contexts, which contains two or more parallel-connected standard processor root units (SPRE) for instruction execution of program instructions for different threads. The multithread processor has two or more context memories, which each temporarily store a current processor state of a thread which is currently being executed, and a thread monitoring unit, by means of which each standard processor root unit can be connected to each context memory.
Embedded processors and their architectures are measured by their computation power, their power consumption, their throughput, their costs and their real-time capability. The principle of pipelining is used in order to increase the throughput and the processor speed. The basic idea of pipelining is to subdivide any desired instructions or commands into processing phases with the same time duration. A pipeline with different processing elements is possible when the processing of an instruction can itself be split into a number of phases with disjunct process steps which can be carried out successively (as described in T. Bayerlein, O. Hagenbruch: “Taschenbuch Mikroprozessortechnik” [Microprocessor technology manual], 2nd Edition Fachbuchverlag Leipzig from the Karl Hanser Verlag Munich-Vienna, ISBN 3-446-21686-3). The original two instruction execution phases of the von Neumann model, specifically instruction fetching and instruction processing, are in this case further subdivided, since this subdivision into two phases has been found to be too coarse for pipelining. The pipeline variant which is essentially used for RISC processors contains four instruction processing phases, specifically instruction fetching, instruction coding/operand fetching (instruction decoding/operand fetch), instruction execution and write-back.
In the first phase, the instruction fetching phase, the instructions are loaded from the memory into a pipeline register for the processor.
The second instruction phase, which is an instruction decoding/operand fetching phase, comprises two data-independent process steps, specifically instruction decoding and fetching operands. In the instruction decoding step, the data coded in the instruction code is decoded in a first data processing step. In this case, the operation rule (opcode), the addresses of the operands to be loaded, the type of addressing and further additional signals which significantly influence the subsequent instruction execution phases are determined in a known manner. In the operand fetching processing step, all of the addressed operands are loaded from the registers for the processor.
In the third instruction phase, the instruction execution phase, the computation operations are executed in accordance with the decoding commands or instructions. The operation itself, as well as the circuit paths and processor registers used in the process, depend essentially on the nature of the instruction to be processed.
The results of the operations, including so-called additional signals, status flags or flags, are stored in the appropriate registers or external memories in a known manner in the fourth and final phase, the write-back phase. This phase completes the arithmetic processing of a machine command or machine instruction.
A conventional computer architecture for a von Neumann machine has a standard processor unit SPE which processes instructions or program commands for a thread T (monitoring thread).
A standard processor unit SPE essentially has an instruction fetch unit BHE, an instruction decoding/operand fetch unit BD/OHE, an instruction execution unit BAE and a write-back unit ZSE, as well as a pipeline register, a processor monitoring unit PKE, register files K, a program counting register PZR and a data bus DB.
A thread T denotes a monitoring thread for a code, a source code or a program, with data relationships existing within a thread T and with weak data relationships existing between different threads T (as described in Chapter 3 of T. Bayerlein, O. Hagenbruch: “Taschenbuch Mikroprozessortechnik” [Microprocessor technology manual], 2nd Edition Fachbuchverlag Leipzig from the Karl Hanser Verlag Munich-Vienna, ISBN 3-446-21686-3).
One characteristic of a process is that a process always accesses its own memory area. A process comprises two or more threads. Accordingly, a thread is a program part of a process step. A context of a thread is the processor state of a processor which is processing this thread or instructions for this thread. The context of a thread is accordingly defined as a temporary processor state during the processing of the thread by this processor. The context is helped by the hardware of the processor, specifically the program counting register PZR or program counter PC, the register file or context memory K, and the status register SR associated with it.
FIG. 1 shows an example in order to explain the problems with conventional von Neumann machines with regard to their throughput and their blocking probability. The von Neumann machine or standard processor unit SPE illustrated in FIG. 1 has, corresponding to the subdivision of an instruction I into various instruction phases, a hardware subdivision (not shown) of the processor into different components, an instruction fetch unit BHE, an instruction decoding/operand fetch unit BD/OHE, an instruction execution unit BAE and a write-back unit ZSE. The standard processor unit SPE uses the pipeline method for processing, with the instruction decoding/operand fetch unit BD/OHE for the standard processor unit SPE being illustrated in each clock cycle t1, t2, tx in FIG. 1, in order to illustrate the problems.
The instruction decoding/operand fetch unit BD/OHE for the standard processor unit SPE decodes a j-th instruction Ikj for a thread T-k in each clock cycle ti. Based on the example shown in FIG. 1, the instruction decoding/operand fetch unit BD/OHE processes the first instruction IA1 for the thread T-A in the clock cycle t1, and the instruction decoding/operand fetch unit BD/OHE decodes the second instruction IA2 for the thread T-A in the clock cycle t2. Because, for example, operands relating to this instruction IA2, which have to be waited for, have to be fetched on the basis of the instruction I12 for the thread T-A, this results in blocking of the pipeline for the standard processor unit SPE.
Since it has not been possible for the instruction decoding/operand fetch unit BD/OHE to process any instructions for a certain number of clock cycles owing to the latency time caused by the instruction I12, the instruction decoding/operand fetch unit does not decode the instruction IA3 until the clock cycle tx.
The standard processor unit SPE according to the prior art processes commands and instructions for a specific thread T-A at each time (here: t1, t2, tx). This simple computer architecture results in a problem when the thread T-A to be processed is temporarily blocked. There may be various reasons for temporary blocking of a thread T, for example as a result of a latency time when accessing an external memory or an internal register.
Temporary blocking of the standard processor unit SPE occurs when the processor pipeline or the instruction decoder cannot process any further program command or any further instruction for a thread T, or cannot decode them.
If a thread T which has been processed by the standard processor unit SPE is temporarily blocked, then the standard processor unit SPE is also blocked for this time. Blocking considerably reduces the performance and throughput of a processor. The probability of a thread T being blocked by internal blocking (for example as a result of a latency time for a register access) or by external blocking (for example in the event of a cache miss), is referred to as the blocking probability pT for that thread. If it is assumed that the blocking probability for a thread T is pT, then the blocking probability pVN for the von Neumann machine illustrated schematically in FIG. 1 is likewise pT.
A multithread processor MT is a processor which uses hardware to process a number of contexts by providing a program counting register PZR, a register bank RB and a status register SR for each of this number of contexts.
FIG. 2a illustrates, schematically, a conventional multithread processor MT, in which a standard processor unit SPE is processing a number of threads T or monitoring threads, light-weight tasks, separate program codes, and common data areas. In FIG. 2a, without any restriction to generality, the threads T-A, T-B represent any given number N of threads and are hard-wired to the standard processor unit SPE within a multithread processor MT, thus ensuring more efficient switching between individual threads T. This reduces the blocking probability pMT for a multithread processor MT in comparison to the blocking probability pVN of a von Neumann machine (FIG. 1) for the same thread blocking probability pT, since result operations of the memory minimize inefficient processor waits. In order to illustrate the described situation, FIG. 2a shows the instruction decoding/operand fetch unit BD/OHE for the standard processor unit SPE at the times t1, t2, t5 during the processing of instructions IAj (T-A) for the thread (T-A) and during the processing of instructions IBj (T-B) for the thread (T-B). At the time t2, the thread T-A is a blocked thread T*-A (for the reasons mentioned above), and is substituted by the thread T-B, which is likewise hard-wired to the standard processor unit SPE. From then on (after the time t5), the standard processor unit SPE processes instructions IBj for the thread T-B until the thread T-B is blocked. Furthermore, if pT denotes the blocking probability for a thread T, the blocking probability pMT for a multithread processor MT is dependent on the number N of context memories K-1−K-N which are hard-wired to a standard processor unit SPE, with a context memory K-i providing the hardware implementation for a thread T-k. The blocking probability pMT of the multithread processor MT is pTN, on the assumption of statistical independence of the individual threads T which each represent independent data areas. The blocking probability pMT is accordingly admittedly reduced, but the throughput is increased only insignificantly in comparison to a von Neumann machine as shown in FIG. 1.
If the pipeline for the standard processor unit SPE becomes blocked when carrying out an instruction or a program command for a thread T, signals which can be used to control the thread switching are generated within the SPE.
FIG. 2b shows a schematic block diagram of a conventional multithread processor MT according to the prior art with a connection to an instruction memory BS and a connection to a data bus DB. A conventional multithread processor MT according to the prior art essentially has a standard processor unit SPE, N context memories K for N threads T, a thread monitoring unit TKE and a processor monitoring unit PKE. The said components of the multithread processor MT are essentially connected to one another via data lines DL, address lines AL and multiplexers M′, M″. Each of the N context memories K has a program counting register PZR, a register bank RB and a status register SR. An N×1 multiplexer M′ is used to apply data from the N program counting registers PZR for the N context memories K to an address line AL, which is connected to the instruction memory BS. Based on these addresses, the instructions I for the respective threads T which are stored in the instruction memory BS are fetched from the instruction memory BS, and are made available to the standard processor unit SPE. The standard processor unit SPE has an instruction decoding/operand fetch unit BD/OHE, an instruction execution unit BAE and a write-back unit ZSE, corresponding to the phases of the pipeline method.
The instructions I, which result from the instruction fetches, for the N threads T are decoded in the instruction decoding/operand fetch unit BD/OHE, and the addressed operands are read from the register bank RB or from the status register SR of the appropriate context memory K for the associated thread T via an N×1 multiplexer M′. The instruction execution unit BAE then executes the decoded instruction I for the thread T with its associated operands, and passes the result of the operation as well as potential flag data characters and additional characters to the write-back unit ZSE. In the case of a memory instruction, the write-back unit ZSE if necessary writes the received data to the data bus DB or, via a 1×N multiplexer, to the respective context memory K. In this case, the results of the arithmetic/logic operations are stored in the associated register bank RB, and the flags and additional characters are stored in the associated status register SR.
The flag or additional characters in the status register SR are provided via an N×1 multiplexer M′ to the processor monitoring unit PKE, which uses internal control signals ISS to control the instruction decoding/operand fetch unit BD/OHE, the instruction execution unit BAE and the write-back unit ZSE.
The thread monitoring unit TKE controls the processes of writing to and reading the context memory K via the respective multiplexers M′, M″ by means of multiplexer control signals <t>. In a multiplexer control signal <t>, t denotes a thread number for a thread T. The thread monitoring unit TKE uses internal event control signals ESS′ and external control signals ESS″ as an input, with the internal event control signals ESS′ being produced by the standard processor unit SPE, and the external event control signals ESS″ being produced by external devices, such as external memories or the program code.
FIG. 3a shows one known development of the prior art, with a number of multithread processors MT in this case being used in parallel. In this case, a dedicated set of context memories K is available to each standard processor SPE, with the context memories K being hard-wired to the respective standard processor unit SPE.
FIG. 3a shows a parallel multithread processor MT, having two standard processor units SPE-A, SPE-B and four context memories for four disjunct threads T-A, T-B, T-C and T-D. The threads T-A, T-B and their hardware implementations and context memories K-A, K-B are permanently connected or hard-wired to the standard processor unit SPE-A. In a corresponding way, the threads T-C, T-D and their context memories K-C, K-D, respectively, are permanently connected or hard-wired to the standard processor unit SPE-B.
FIG. 3b shows two instruction decoding/operand fetch units BD/OHE for two parallel standard processor units SPE-A, SPE-B, with the blocking probability of the overall system being the same as the blocking probability of a multithread processor MT as shown in FIG. 2a, accordingly being pTN. The throughput of the arrangement shown in FIG. 3b is essentially twice as great as the throughput of the arrangement shown in FIG. 2a. In general, the throughput of M parallel multithread processors is M times as great as that of a single multithread processor MT, although the blocking probability of the overall system increases sharply.