Graphics processing involves a performance of rapid mathematical calculations for image rendering. Such graphics workloads may be performed at a graphics processing unit (GPU), which is a specialized electronic circuit, to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs may also be implemented as a general-purpose computing on GPU (GPGPU) to perform computations traditionally handled by a central processing unit (CPU). Accordingly, GPGPUs may be implemented to execute SIMD instructions.
GPGPU architectures with physically narrower SIMD widths often execute instructions by folding logical SIMD widths to multiple back to back components in the narrow physical channels. For example, a SIMD 16 instruction may be executed in floating point units (FPUs) by transmitting four back to back simd4. The advantage of operating in narrower physical channels is that when executing similar data (e.g., as in the case when processing pixel data like RGBA), data toggles are suppressed to save dynamic power in high power consuming logical circuitry such as a FPU. While this provides power advantage, this type of architecture is also detrimental in terms of area efficiency by repeating the same controllers for each of the four wide SIMDs.