The technology described herein relates to data processing systems, and in particular to the operation of graphics processing systems that include one or more programmable processing stages (“shaders”).
As is known in the art, graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately transformed and lit vertex data in the case of a vertex shader) for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics item in a graphics output to be generated such as a render target, e.g. frame (an “item” in this regard is usually a vertex or a fragment (pixel)). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
As is known in the art, a shader program to be executed by a given “shader” of a graphics processing pipeline will be provided by the application that requires the graphics processing using a high-level shader programming language, such as GLSL, HLSL, OpenCL, etc. This shader program will consist of “expressions” indicating desired programming steps defined in the relevant language standards (specifications). The high-level shader program is then translated by a shader language compiler to binary code for the target graphics processing pipeline. This binary code will consist of “instructions” which are specified in the instruction set specification for the given target graphics processing pipeline. The compilation process for converting the shader language expressions to binary code instructions may take place via a number of intermediate representations of the program within the compiler, as is known in the art. Thus the program written in the high-level shader language may be translated into a compiler specific intermediate representation (and there may be several successive intermediate representations within the compiler), with the final intermediate representation being translated into the binary code instructions for the target graphics processing pipeline.
A known way to improve shader execution efficiency is to group execution threads (where each thread corresponds to one vertex or one fragment (pixel)) into “groups” or “bundles” of threads, where the threads of one group are run in lockstep, one instruction at a time. This way, it is possible to share instruction fetch and scheduling resources between all the threads in the group. (Other terms used for such thread groups include “warps” and “wavefronts”. For convenience the term thread group will be used herein, but this is intended to encompass all equivalent terms and arrangements, unless otherwise indicated.)
As is known in the art, execution threads execute sequences of program instructions. To facilitate this operation, it is necessary to keep track of which instruction in the sequence a thread is currently executing. To achieve this, as is known in the art, each thread normally has a program counter (PC) (also known as an instruction counter or an instruction pointer) associated with it which indicates the memory address of the instruction to be executed by the thread (and thereby where the thread is in its instruction sequence). (Thus instruction execution follows the flow: read instruction at address indicated by the program counter, execute the instruction and increment the program counter (and so on)).
Each program counter effectively points to the address in memory where the relevant instruction is stored, and thus may comprise, e.g., a 48-bit virtual address. The program counter for a thread is normally a binary counter that may be stored, e.g. in a register file and that is appropriately updated as instructions are executed, so as to point to the instruction that should be executed by the thread.
In normal operation, instructions executed by execution threads are first loaded into an instruction cache, from where they are then read by the thread and executed. Such an instruction cache will include, as is known in the art, plural cache lines, with each cache line including a number of fields, typically an address field indicating the address in main memory where the instructions in the cache line were read from, a valid bit indicating that the cache line contains valid data, and then a data field which stores the instructions themselves.
When an instruction is to be executed by a thread, the thread will determine if the instruction is present in the instruction cache. If the instruction is present in the instruction cache (there is a cache “hit”), the instruction can then be executed. If the instruction is not present in the instruction cache (there is cache “miss”), the thread is stalled until the relevant instruction is fetched into the cache.
FIGS. 1 and 2 show exemplary program counter storage 10 and an instruction cache 11 for an arrangement that can support the execution of n threads simultaneously. As shown in FIG. 1, there will be a separate program counter 12 for each thread in the program counter storage 10. FIG. 2 correspondingly shows the instruction cache 11, having n cache lines 13, each cache line storing, as discussed above, an address field 14, a valid bit 15 and data (the instructions) 16.
Graphics processors are usually highly multi-threaded. Each execution unit (shader) could, for example, have of the order of 128 active threads at any one time, with each of these threads having its own distinct program counter. This number of, e.g. 48-bit virtual address, program counters is too expensive to store in flip-flops and so the program counters are usually stored in a register file. Instruction execution by a graphics shader will accordingly then comprise reading the program counter from the program counter storage register file, performing an instruction cache match, doing an instruction cache read, executing the instruction, and then writing the updated program counter to the program counter storage register file. This all uses power.
The Applicants believe that there remains scope for improvements to the handling of instruction execution in data processing systems, and in particular for thread groups in graphics processing pipelines that include one or more shader stages.
Like reference numerals are used for like components where appropriate in the drawings.