Traditional microprocessors issue and execute single instructions one after the other. The instructions typically perform a single operation on two scalar values, producing a result. Single-issue variants issue one instruction per clock cycle, which is then processed by one of the execution units. Execution units typically include at least an adder, a multiplier, a load/store unit, and a branch unit. Processors such as these run a single program thread at a time, and belong to the group of single threaded processors, although an operating system may create the illusion of multiple simultaneous threads by configuring the processor to switch between threads at a certain interval. Although these kinds of processors have low performance, they also have a small silicon area, and therefore a reasonable performance per silicon area. There are processor variants which issue and execute multiple instructions at the same time. These multiple-issue variants kinds of processors have low performance, they also have a small silicon area, and therefore a reasonable performance per silicon area. There are processor variants which issue and execute multiple instructions at the same time. These multiple-issue variants look ahead in the instruction stream to find instructions that can be processed in parallel by the different execution units. To increase performance, a processor may also have multiple instances of selected execution units. This results in fast execution of each program thread. However, dependencies between instructions in a thread limit the number of instructions that can be executed in parallel, leaving execution units unused, and the logic required to extract the parallel instructions requires a significant amount of silicon area and power. The routing logic to route values to and from the execution units is also significant. The result is poor efficiency, measured in performance per silicon area and performance per watt.
One type of processor that can achieve a higher performance per silicon area is known as Single Instruction Multiple Data (SIMD Processor). This type of processor operates on fixed width vectors rather than scalar values. Each instruction performs its operation on multiple scalars at a time, using vectorized execution units that are constructed from an array of scalar units arranged in separate lanes. SIMD Processors can be single-issue or multiple-issue. However, the programmer or source language compiler often cannot express the operation to be performed using vectors, in many cases utilizing only one lane of the vectorized execution units.
Another type of processor that can achieve a higher performance per silicon area is known as Very Long Instruction Word Processor (VLIW Processor), where each instruction describes the operation of all the execution units in the processor. In this way, all the execution units can operate every cycle, without the need for multiple-issuing hardware.
The simplest Multiple-issue, SIMD and VLIW processors run a single thread at a time, and may therefore be referred to as single threaded processors. Coherent Vector Threaded Processors are a kind of processor which is similar to SIMD processors in that multiple parallel registers are hardwired to multiple parallel execution units arranged in lanes, but where each lane executes a separate program thread. The bandwidth required for instruction fetch is lower than for the other types of processors since multiple threads execute the same instruction from the instruction fetch unit. Parallelism is achieved by executing multiple threads in lock-step, so the simplest form of single issue instruction sequencer is sufficient for good performance and efficiency, though Coherent Vector Threaded Processor architectures with multiple-issue and VLIW are also possible. The threads start out at the same program location, but may branch to multiple different locations, an event known as divergence. During divergence, each one or more threads branch to the same program location. The processor can issue instructions for only a limited number of divergent threads simultaneously, but will prioritize which thread to issue instructions for so that the threads end up at the same program location again, an event known as reconvergence. These processors typically enforce such reconvergent execution over a fixed number of threads, known as a warp or wavefront.
Although these processors provide useful data processing functionality, they each have their own shortcomings. Accordingly, it is desired to provide improved data processing techniques.