The technology described herein relates to graphics processors, and in particular to the operation of graphics processors that include one or more programmable processing stages (“shaders”).
Graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately shaded and rendered fragment data in the case of a fragment shader) for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics item in a graphics output to be generated, such as a render target, e.g. frame (an “item” in this regard is usually a vertex or a sampling position (e.g. in the case of a fragment shader). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
In graphics shader operation, each “item” will be processed by means of an execution thread which will execute the shader program in question for the graphics “item” in question.
A known way to improve shader execution efficiency is to group execution threads (where each thread corresponds, e.g., to one vertex or one sampling position) into “groups” or “bundles” of threads, where the threads of one group are run in lockstep, one instruction at a time. In this way, it is possible to share instruction fetch and scheduling resources between all the threads in the group. (Other terms used for such thread groups include “warps” and “wavefronts”. For convenience the term “thread group” will be used herein, but this is intended to encompass all equivalent terms and arrangements, unless otherwise indicated.)
In the case of fragment shaders (shaders that are run once for each sampling position in a primitive of a render output), a thread group typically contains one or more “quads”, where a “quad” is a 2×2 group of sampling positions. Organising sampling positions into quads in this manner is beneficial for some of the operations done in a fragment shader, such as sampling point-delta calculations and operations that depend on such sampling point-deltas, such as texture lookups with mipmapping.
The Applicants have recognised that in such arrangements, at the edge of a primitive, the 2×2 sampling position quad may straddle the edge of the primitive, with some of the sampling positions being inside the primitive and other sampling positions being outside. In this case, threads will still be started for all four sampling positions in the quad, but in effect only those sampling positions that are covered by the primitive will be “active” threads.
However, some fragment shading processing is still required for the threads that correspond to sampling positions that are outside the primitive (which will be referred to herein as “inactive” and/or “helper” threads for convenience). This is because, for example, some of the fragment shading results for such inactive (“helper”) threads are needed as inputs to some of the fragment shading operations, such as sampling position-delta type calculations. (The results computed by such helper threads are then discarded at the end of the fragment shader (as they are not otherwise required).)
FIG. 1 illustrates this situation.
FIG. 1 shows schematically an array 101 of sampling positions 102 of a graphics output to be generated, in which respective 2×2 groups of sampling positions are organised into “quads” 103 that will be processed in common.
When a primitive is rasterised, the rasteriser will identify those sampling position quads (groups) that are at least partially covered by the primitive, and issue a graphics fragment for rendering for each at least partially covered quad (group (set) of sampling positions). Each such fragment will then be processed by the fragment shader as a thread group (warp), with each covered sampling position being an “active” thread in the group (warp) and any uncovered sampling positions being inactive (“helper”) threads.
FIG. 1 shows an exemplary primitive 104 overlaying the array of sampling positions. It can be seen from FIG. 1 that when the sampling position quad 105 falls to be processed for the primitive 103, three of the sampling positions within the 2×2 quad 105 will be “active” (i.e. represent sampling positions within the primitive 3), but the fourth sampling position 106 in the quad 104 will be “inactive” as it falls outside the edge of the primitive 103. Thus, when the fragment shader program is executed for the quad 104, the thread within the thread group for the uncovered sampling position 106 will be executed and treated as a “helper” thread.
The Applicants believe that there remains scope for improvements to the handling of thread groups, particularly in shaders of graphics processing pipelines.
Like reference numerals are used for like components where appropriate in the drawings.