The technology described herein relates to data processing systems, and in particular to the operation of graphics processing systems that include one or more programmable processing stages (“shaders”).
Graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately transformed and lit vertex data in the case of a vertex shader) for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics item in a graphics output to be generated such as a render target, e.g. frame (an “item” in this regard is usually a vertex or a fragment (pixel)). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
A shader program to be executed by a given “shader” of a graphics processing pipeline will be provided by the application that requires the graphics processing using a high level shader programming language, such as GLSL, HLSL, OpenCL C, etc. This shader program will consist of “expressions” indicating desired programming steps defined in the relevant language standards (specifications). The high level shader program is then translated by a shader language compiler to binary code for the target graphics processing pipeline. This binary code will consist of “instructions” which are specified in the instruction set specification for the given target graphics processing pipeline. The compilation process for converting the shader language expressions to binary code instructions may take place via a number of intermediate representations of the program within the compiler. Thus the program written in the high level shader language may be translated into a compiler specific intermediate representation (and there may be several successive intermediate representations within the compiler), with the final intermediate representation being translated into the binary code instructions for the target graphics processing pipeline.
A known way to improve shader execution efficiency is to group execution threads (where each thread corresponds, e.g., to one vertex or one fragment (pixel)) into “groups” or “bundles” of threads, where the threads of one group are run in lockstep, one instruction at a time, i.e. each thread in the group executes the same single instruction before moving onto the next instruction. This way, it is possible to share instruction fetch and scheduling resources between all the threads in the group. (Other terms used for such thread groups include “sub-groups”, “warps” and “wavefronts”. For convenience the term thread group will be used herein, but this is intended to encompass all equivalent terms and arrangements, unless otherwise indicated.)
In such thread groups, the shared instructions are executed for each thread and when memory access (e.g. for loads, stores and atomics) is required by a particular thread, a “lock” is obtained by the thread to ensure exclusive access to the data in memory being accessed by the thread, i.e. excluding the other threads from accessing the memory location, with each thread obtaining independent locks. However, because the threads cannot guarantee independent forward progress in all circumstances (because some steps executed by one thread may depend on other steps executed by other threads), a lock obtained by one thread may cause a deadlock for the whole thread group. This then causes one or more of the threads to spin indefinitely waiting for the lock to become available.
For example, the simple code sequence
 while( !mtx_lock = mtx.trylock( ) ){ // critical section}can deadlock non-intuitively.
This is because the sequence requires that the flow of code is divergent and so threads from the divergent path (i.e. with a different program counter) must be masked out from execution. In a system having a “lowest program counter first” scheduling policy, threads which did not obtain the lock would therefore be favoured, but these would then spin indefinitely waiting for the lock.
One solution to this problem is to re-order the code to suit the scheduling behaviour of the specific system being used, e.g. using the following code sequence:
 if( __my_warp_lane == 0 ){ while( !mtx.try_lock( ) );}// execute locked workif( __my_warp_lane == 0 ){ mtx.unlock( );}
However, this requires that the compiler does not optimise the instruction sequences, or has support for ordering semantics on the instructions.
The Applicants believe therefore that there remains scope for improvements to the handling of thread groups, for example in graphics processing pipelines that include one or more shader stages.