The technology described herein relates to data processing systems, and in particular to the operation of graphics processing systems that include one or more programmable processing stages (“shaders”).
As is known in the art, graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately transformed and lit vertex data in the case of a vertex shader) for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics item in a graphics output to be generated such as a render target, e.g. frame (an “item” in this regard is usually a vertex or a fragment (pixel)). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
As is known in the art, a shader program to be executed by a given “shader” of a graphics processing pipeline will be provided by the application that requires the graphics processing using a high-level shader programming language, such as GLSL, HLSL, OpenCL, etc. This shader program will consist of “expressions” indicating desired programming steps defined in the relevant language standards (specifications). The high-level shader program is then translated by a shader language compiler to binary code for the target graphics processing pipeline. This binary code will consist of “instructions” which are specified in the instruction set specification for the given target graphics processing pipeline. The compilation process for converting the shader language expressions to binary code instructions may take place via a number of intermediate representations of the program within the compiler, as is known in the art. Thus the program written in the high-level shader language may be translated into a compiler specific intermediate representation (and there may be several successive intermediate representations within the compiler), with the final intermediate representation being translated into the binary code instructions for the target graphics processing pipeline.
A known way to improve shader execution efficiency is to group execution threads (where each thread corresponds to one vertex or one fragment (pixel)) into “groups” or “bundles” of threads, where the threads of one group are run in lockstep, one instruction at a time. This way, it is possible to share instruction fetch and scheduling resources between all the threads in the group. (Other terms used for such thread groups include “warps” and “wavefronts”. For convenience the term thread group will be used herein, but this is intended to encompass all equivalent terms and arrangements, unless otherwise indicated.)
Execution units in a thread group architecture are typically deeply pipelined. They are normally kept at high utilization by alternating execution between multiple thread groups on a per-clock-cycle basis.
In existing thread group architectures, execution of instructions in a group is done in a strictly in-order manner: once a thread group has begun executing an instruction, it cannot proceed to the next instruction until the current instruction has completed.
This is not normally a problem for arithmetic instructions, as these typically have fairly short latencies. The GPU normally has more than enough thread groups in flight to keep up utilization for arithmetic-only code fairly well.
However, it can be much more of a problem for instructions with nondeterministic latency. Examples of such instructions include: memory accesses (loads, stores and atomics) and texture lookups. Memory accesses may have low latency in case of a cache hit, but tend to exhibit quite high latencies in case of cache misses. This is particularly the case for loads and atomics (for stores, it is often possible to mask latency by performing write gathering). Texture lookups tend to feature memory accesses with relatively high cache miss rates, as well as a large number of fixed-function processing stages both before and after the memory accesses.
In an in-order thread group-based architecture, a few cache misses can easily cause most or even all of the thread groups to get blocked while waiting for the cache misses to get resolved. This is sometimes also aggravated by multiple thread groups producing cache misses for the same memory region.
Also, due to the lockstep execution of threads, it is possible to get multiple cache misses for a single thread group instruction. This happens when the individual threads in a group get cache misses in different cache lines. In this case, the group cannot proceed until all the cache misses have been resolved, which may take a long time.
There sometimes also arises a “bunching” phenomenon, where large groups of thread groups form a “bunch”. This “bunch” then tends to move as a whole between different types of execution units with the result that the execution unit currently occupied by the “bunch” is executing at full rate while other execution units stand idle. The actual execution units that stand idle then shift around while the “bunch” executes. This gives fairly poor execution unit utilization.
The Applicants believe therefore that there remains scope for improvements to the handling of thread groups, for example in graphics processing pipelines that include one or more shader stages.
Like reference numerals are used for like components where appropriate in the drawings.