The technology described herein relates to data processing systems, and in particular to the operation of graphics processing systems that include one or more programmable processing stages (“shaders”).
As is known in the art, graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately transformed and lit vertex data in the case of a vertex shader) for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics item in a graphics output to be generated such as a render target, e.g. frame (an “item” in this regard is usually a vertex or a fragment (pixel)). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
As is known in the art, a shader program to be executed by a given “shader” of a graphics processing pipeline will be provided by the application that requires the graphics processing using a high-level shader programming language, such as GLSL, HLSL, OpenCL, etc. This shader program will consist of “expressions” indicating desired programming steps defined in the relevant language standards (specifications). The high-level shader program is then translated by a shader language compiler to binary code for the target graphics processing pipeline. This binary code will consist of “instructions” which are specified in the instruction set specification for the given target graphics processing pipeline. The compilation process for converting the shader language expressions to binary code instructions may take place via a number of intermediate representations of the program within the compiler, as is known in the art. Thus the program written in the high-level shader language may be translated into a compiler specific intermediate representation (and there may be several successive intermediate representations within the compiler), with the final intermediate representation being translated into the binary code instructions for the target graphics processing pipeline.
A known way to improve shader execution efficiency is to group execution threads (where each thread corresponds to one vertex or one fragment (pixel)) into “groups” or “bundles” of threads, where the threads of one group are run in lockstep, one instruction at a time. This way, it is possible to share instruction fetch and scheduling resources between all the threads in the group. (Other terms used for such thread groups include “warps” and “wavefronts”. For convenience the term thread group will be used herein, but this is intended to encompass all equivalent terms and arrangements, unless otherwise indicated.)
It is common in modern graphics processing units to have instructions to perform “atomic” memory operations. An “atomic” memory operation is an operation sequence that reads a memory location, performs an arithmetic operation between the memory value and a register value, and then writes the result of the arithmetic operation back to the same memory location. This sequence of operations is carried out so that to every observer, it looks as if the sequence has either not been performed at all, or been performed in its entirety. It is executed as one indivisible unit, hence the name “atomic”.
The arithmetic operation that is performed as part of the atomic is usually a simple operation that is both commutative and associative. Common examples of such operations are: Integer Add; Bitwise AND, OR, XOR; Minimum Value; Maximum Value.
Other atomic operations that are common but not commutative/associative are:
Swap (also known as Exchange)—the register value is written to the memory location and the old value of the memory location is then returned to the shader;
Compare-And-Swap (Compare and Exchange)—two register values are provided, the memory value is compared to the first of them, and if they compare equal, then the second register value is written to the memory location.
Atomic memory operations typically occur in two variants: “Return” and “NoReturn”. The “Return” variants return back to the shader the datum that was present at the memory location before the “atomic” operation was performed. The “NoReturn” variants do not return any value back to the shader. Typically, in a GPU, all of the operations listed above are available in both variants (except Swap, which is not available in the NoReturn variant).
Typical examples of use-cases for GPU atomics include incremental memory allocators (Integer Add of the “Return” type), histograms (Integer Add of the “NoReturn” type) and bounding-box computation (Minimum/Maximum Value of the “NoReturn” type).
In a thread group-based (warp-based) GPU architecture, when an atomic instruction is executed, it is typically started for all the threads in the group at the same time. This can lead to any one of several common cases: all of the atomics go to the same memory address; some but not all of the atomics go to a single memory address; or the atomics go to unrelated memory locations. If multiple atomics are performed on the same memory location, they need to be serialized. This is necessary to maintain the observable semantic that each of them has been performed either in full or not at all.
The Applicants believe that there remains scope for improvements to the handling of atomic operations for thread groups, for example in graphics processing pipelines that include one or more shader stages.
Like reference numerals are used for like components where appropriate in the drawings.