The technology described herein relates generally to the operation of data processing systems that include programmable processing stages, such as a graphics processing system that includes one or more programmable processing stages (“shaders”).
Graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of: a geometry shader, a vertex shader, and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desire set of output data, for example appropriately shaded and rendered fragment data in the case of a fragment shader, for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be executed by distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics “work” item in a graphics output, such as a render target, e.g. frame, to be generated (an “item” in this regard is usually a vertex or a sampling position (e.g. in the case of a fragment shader)). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
In graphics shader operation, each work “item” will be processed by means of an execution thread which will execute the instructions of the shader program in question for the graphics work “item” in question.
The actual data processing operations that are performed by the shader program will be performed by respective functional units, such as arithmetic units, of the graphics processor, in response to, and under the control of, the instructions in the shader program being executed. Thus, for example, appropriate functional units, such as arithmetic units, will perform data processing operations in response to and as required by instructions in a shader program being executed. Typically, there will be a plurality of functional units provided in a graphics processor (GPU), each of which can be respectively and appropriately activated and used for an execution thread when executing a shader program.
The functional units provided in a graphics processor for the purpose of performing operations required by shader programs (and otherwise) may be (and are typically) shared between different shaders (shader stages) of the graphics processing pipeline being executed by the graphics processor. Thus, there will typically be a set of functional units arranged for, and usable for, the execution of shader programs. It would also be possible to provide separate functional units for separate shaders (or for some functional units to be separate and for some to be shared), as desired.
Shader program execution efficiency may be improved by grouping execution threads (where each thread corresponds, e.g., to one vertex or one sampling position) into “groups” or “bundles” of threads, where the threads of one group are run in lockstep, one instruction at a time. In this way, it is possible to share instruction fetch and scheduling resources between all the threads in the group. Other terms used for such thread groups include “warps” and “wave fronts”. For convenience, the term “thread group” will be used herein, but this is intended to encompass all equivalent terms and arrangements, unless otherwise indicated.
In a system where execution threads can be grouped into thread groups, then the functional units for performing the processing operations in response to the instructions in a shader program are normally correspondingly operable so as to facilitate such thread group arrangements. For example, a functional unit may be arranged as respective execution lanes, one for each thread that a thread group may contain (such that, for example, for a system in which execution threads are grouped into groups (warps) of four threads, the functional units may each be operable as four respective (and identical) execution lanes), so that the functional unit can execute the same instruction in parallel for each thread of a thread group.
The Applicants believe that there remains scope for improvements to the handling of thread groups, particularly for shaders of graphics processing pipelines.
Like reference numerals are used for like components where appropriate in the drawings.