Shaders written in a shader language expose parallelism through the programming model of the shader langue. A scalar shader program is written for a single element (e.g., a pixel, a vertex, a thread, etc.), but several independent elements can be processed by the same program simultaneously. While GPU (graphics processing unit) hardware is designed to accommodate this programming or execution model, when a software (CPU-based) rasterizer is used in place of a GPU, the software rasterizer must pack independent computations efficiently to deliver reasonable performance on a CPU. That is, a shader program transformed for CPU execution should exploit CPU vector instructions, available in most modern CPUs, to attain up to W times increase in performance, where W is the vector width of the CPU. Such packing will be referred to herein as vectorization, and may involve both transforming the original program to a suitable form (described herein) and properly laying out resources in memory.
Vectorization of shader code compiled for a GPU (i.e., intermediate representation (IR) code, bytecode, etc.) is non-trivial in the presence of control flow logic, especially for compute shaders, due to possible divergence of execution for elements processed together. The vectorization task is further complicated by the desirability of running such an algorithm with high speed and while not overly increasing the size of the IR code, thus allowing for just-in-time (JIT) compiling of the vectorized IR code to native executable machine code. In addition, the vectorized IR code should be suitable for traditional compiler optimizations.
Techniques related to efficient vectorization of IR code compiled from shader language code while assuring correctness are discussed below.