Instruction level latency can vary widely across instruction types in SIMD machines such as graphics processing units (GPUs). On modern GPUs, many instructions can have varying levels of latency. For example, 3×3 tensor operations (i.e., matrix multiplies) may take ten (10) clock cycles to execute, 4×4 matrix multiply operations might take fifty (50) cycles, while other instructions like multiply operations might only require one (1) cycle. As more instructions with higher latency are added to instruction sets and used in programs, this higher latency becomes a system level performance bottleneck which may cause the number of instructions per cycle (IPC) to decline. Accordingly, techniques to manage instruction latency in SIMD machines may find utility, e.g., in graphics processing.