FIG. 1 of the accompanying drawings illustrates a previously considered microprocessor architecture which includes a Single Instruction Multiple data (SIMD) processor array. The processor includes a serial processor 10 operable to process non-parallel instructions such as fetching instructions, and performing branching and calculations. The architecture also includes a SIMD array 19 of processing elements 20, which are controlled by an array controller 12. The array controller 12 is controlled by the serial processor 10.
The serial processor 10 receives an instruction stream from an instruction cache (not shown). The serial processor 10 then executes the retrieved serial instructions. The serial processor 10 issues SIMD instructions to the array controller 12 which decodes and prioritises the received SIMD instructions, and sends appropriate control signals to the SIMD array 19 of processing elements (PEs) 20. The SIMD array 19 operates in a known manner, such that the processing elements all carry out the same instructions on data specific to the PE 20.
As illustrated in FIG. 1, the SIMD array 19 comprises an array of processor elements (PEs) 20 which are arranged to operate in parallel. Each PE 20 in the SIMD array 19 contains an Arithmetic Logic Unit (ALU) 22, a register file 24, PE memory 26 and an Input/Output (I/O) unit 26. The SIMD array 19 operates in a synchronous manner, where each PE 20 executes the same instruction at the same time as the other PEs, but using data specific to the individual PE. Each of these execution units in the PE performs a specific task: the ALU 22 performs arithmetic functions, the register file 24 stores data for use by the ALU 22 and for transfer with the internal PE memory 26, and the I/O unit 26 handles data transfer between the PE memory 26 and external memory (not shown). PE data is stored in PE memory 26 and is transferred to the ALU 22 using the register file 24. The array controller 12 issues instructions to the array 19 that cause data to be transferred between the I/O unit 28, the PE memory 26, the register file 24 and the ALU 22, as well as instructions to operate on the data in the ALU 22.
There are some disadvantages with the previously-considered architecture. The instructions are held in one queue to be executed by the PEs, which can cause substantial delay in processing. Also, during the execution of any one instruction, only one execution unit of each PE 20 is occupied. For example, if the instruction is to multiply two numbers together, only the ALU 22 of each PE 20 is working. Alternatively, if the instruction is to fetch a data item from an external memory, then only the I/O 28 of each PE 20 is working.
It is therefore desirable to provide techniques which can overcome these disadvantages.