Embedded processors and their architectures are measured by their computation power, their power consumption, their throughput, their costs and their real-time capability. To increase the throughput and the processor speed, the principle of pipelining is put to use. The basic idea of pipelining is the division of arbitrary instructions or program instructions into processing phases of the same duration in time. A pipeline having various processing elements is possible when the processing of an instruction itself can be divided up into a plurality of phases with disjunct and successively performable process steps. The original two instruction execution phases of the Von Neumann model, namely instruction fetching and instruction processing, are divided further in this context, since this division into two phases proves to be too coarse for pipelining. The pipeline variant essentially applied for RISC processors contains four phases of instruction processing, namely instruction fetch, instruction decoding/operand fetch, instruction execution and writeback.
With regard to the instruction processing in a program code, a thread T denotes a control thread in the code or in the source code or in the program, there being data dependencies within a thread T and there being weak data dependencies between various threads T (as described in section 3 of T. Bayerlein, O. Hagenbruch: “Taschenbuch Mikroprozessortechnik” [Pocket book of microprocessor engineering], 2nd edition Fachbuchverlag Leipzig im Karl-Hanser-Verlag Munich—Vienna, ISBN 3-446-21686-3).
One property of a process is that a process always accesses a dedicated memory area. A process comprises a plurality of threads T. Accordingly a thread T is a program part of a process. A context of a thread is the processor state of a processor which processes this thread T or program instructions from this thread. Accordingly, the context of a thread is defined as a temporary processor state while this processor is processing the thread. The context is held by the hardware of the processor, namely the program count register or program counter, the register file or the context memory and the associated status register.
While a processor is executing a thread, a thread T may be temporarily blocked. Temporary blockage of a thread T may have various causes, for example as a result of a latency during access to an external memory or to an internal register.
A processor is temporarily blocked when the processor pipeline is able to process no further program instruction from a thread T.
To solve the problem of temporary blockage, “multithread processors” are known to be provided. A multithread processor is a processor which provides hardware for executing a plurality of contexts by respectively providing a program count register, a register bank and a status register for the plurality of contexts.
In line with one development of a multithread processor, based on a document which is unpublished by the applicant on the date of application, FIG. 1 shows a block diagram of a parallel multithread processor with shared contexts. The idea underlying the parallel multithread processor with shared contexts is not to divide the N threads or their corresponding N context memories into sets, as in the case of known multithread processors, with each of these sets being directly connected or wired to an associated standard processor root unit, but rather to provide the context memories such that any standard processor root unit can be connected to any context memory.
In this case, the program instructions from the N threads or from the N context memories are dynamically distributed to the M standard processor root units. The N context memories and M standard processor root units are coupled to one another by multiplexers. During execution of the respective program instruction in each pipeline stage of each respective processor or each respective standard processor root unit, the appropriate context is selected by means of a multiplexer. Each program instruction within the standard processor root unit requires the appropriate context. The appropriate context is selected using control signals <t,p>, where t denotes the thread number or thread index and p denotes the standard processor root number or the number of the standard processor root unit.
The parallel multithread processor PMT shown in FIG. 1 is coupled to an instruction memory BS and to a data bus DB. In addition, the multithread processor PMT contains M standard processor root units SPRE, N context memories KS, a thread control unit TK, M processor control units PKE, N instruction buffer stores BZS, N×M multiplexers N×M MUX and M×N multiplexers M×N MUX.
Each standard processor root unit SPRE has an instruction decoding/operand fetch unit BD/OHE, an instruction execution unit BAE and a writeback unit ZSE, these units being arranged to process program instructions in line with a pipeline process.
The instruction fetch unit BHE has an M×N multiplexer M×N MUX, N data lines DL, N instruction buffer stores BZS, a further N data lines DL and an N×M multiplexer N×M MUX.
Each standard processor root unit SPRE has an associated processor control unit PKE provided for it. A processor control unit PKE controls the appropriate standard processor root unit SPRE using internal control signals. A first internal control signal intSS′ for the instruction decoding/operand fetch unit controls the instruction decoding/operand fetch unit BD/OHE in this case. A second internal control signal intSS″ for the instruction execution unit controls the instruction execution unit BAE, and a third internal control signal intSS′″ for the writeback unit controls the writeback unit ZSE.
Each context memory KS has a program count register PZR, a register bank RB and a status register SR. The program count register PZR buffer-stores a program counter for a thread T. An N×M multiplexer N×M MUX places the contents of the N program count registers PZR from N threads T onto an M-channel address bus AB.
The M program instructions referenced by the data contents of the program count registers PZR are read from the instruction memory BS by the instruction fetch unit BHE using an M-channel data bus DB. The data contents which have been read are transferred to N instruction buffer stores BZS by means of an M×N multiplexer M×N MUX. Each of the N threads T has an associated instruction buffer store BZS provided for it. An N×M multiplexer N×M MUX is used to place M program instructions from the N instruction buffer stores BZS onto M data lines DL. The M program instructions on the data lines DL are distributed over the M standard processor root units SPRE.
The instruction decoding/operand fetch unit BD/OHE-i in the standard processor root unit SPRE-i decodes a program instruction from the thread T-j, for example. The decoded program instruction from the thread T-j contains, inter alia, addresses for operands which are required for the subsequent instruction execution. The addressed data contents of the operands are stored in a context memory KS-j which is associated with the thread T-j, more precisely in the register bank RB-j of the associated context memory KS-j. An N×M multiplexer N×M MUX is used to transfer the data contents of the operands from the register bank RB-j of the context memory KS-j to the instruction decoding/operand fetch unit BD/OHE-i in the standard processor root unit SPRE-i, with the N×M multiplexer N×M MUX being controlled by the thread control unit TK using the multiplexer control signal <t,p>[e]. The multiplexers are controlled by means of the multiplexer control signals <t,p> such that the corresponding context memory KS-j is connected to the appropriate pipeline stage of the appropriate standard processor root unit SPRE-i.
The instruction execution unit BAE-i in the standard processor root unit SPRE-i executes the arithmetic and logic operation contained in the program instruction from the thread T-j using the operands which have been fetched from the register bank RB-j.
When the arithmetic and logic operation with the operands which have been fetched has been performed, the result of the operation or additional characters or flags are placed onto a data line DL by the writeback unit ZSE-i. The same data contents are potentially, in the case of a storage instruction, additionally placed onto a further data line DL. The M further data lines DL are provided for coupling the multithread processor PMT to the data bus DB. The data contents of the M results of the M standard processor root unit SPRE are transferred to external memories via the data bus DB.
An M×N multiplexer M×N MUX is used to take the result of the operation or additional characters or flags from the first data line DL and to transfer the result of the arithmetic and logic operation to the register bank RB-j of the context memory KS-j and additional characters to the status register SR-j of the context memory KS-j. The data contents of the N status registers SR are transferred to the M processor control units PKE by means of an N×M multiplexer N×M MUX. The processor control unit PKE-i takes the data contents of the status registers SR and calculates internal control signals, namely the internal control signal for the instruction decoding/operand fetch unit intSS′, the internal control signal for the instruction execution unit intSS″ and the internal control signal for the writeback unit intSS′″.
The thread control unit TK uses the multiplexer control signals <t,p>[a]−<t,p>[f] to control the N×M multiplexers N×M MUX and the M×N multiplexers M×N MUX. The multiplexer control signal <t,p> indicates which thread T-j is processed by which standard processor root unit SPRE-i.
An N×M multiplexer N×M MUX has the function of placing the data from an N-channel data bus onto an M-channel data bus.
An M×N multiplexer M×N MUX has the function of placing the data from an M-channel data bus onto an N-channel data bus.
The M internal event control signals ESS′ contain, inter alia, data about blocked threads T, internal interrupts, waiting times and exception event signals and make these data available to the thread control unit TK.
External event control signals ESS″ are transferred to the thread control unit TK by external devices. Examples of these are external interrupts, which are generated by external devices.
A parallel multithread processor architecture has drawbacks when a task is to be processed which requires the use of various processors or various processor types. Such a task is called a heterogeneous task, since it comprises different threads, which should preferably be processed by processors of different types (e.g. general purpose processor, protocol processor etc.).
A parallel multithread processor is accordingly unsuitable for use as a multilayer network processor.
Even connecting a plurality of parallel multithread processors in parallel would not solve the aforementioned problem, since interprocessor communication between the individual parallel multithread processors would have a disadvantageous effect on the utilization level of the individual parallel multithread processors and the overall system. In the case of parallel-connected parallel multithread processors, the overall context or all of the context memories would again not be accessible by any standard processor root unit, which would increase the blocking probability for the parallel-connected parallel multithread processors.