The technology described herein relates to data processing systems, and in particular to arrangements for the execution of graphics processing operations in a graphics processing unit of a graphic processing system.
Graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately shaded and rendered fragment data in the case of a fragment shader) for processing by the rest of the graphics pipeline and/or for output.
A shader program to be executed by a given “shader” of a graphics processing pipeline will be provided by the application that requires the processing by the graphics processing pipeline using a high level shader programming language, such as GLSL, HLSL, OpenCL, etc. This shader program will consist of “expressions” indicating desired programming steps defined in the relevant language standards (specifications). The high level shader program is then translated by a shader language compiler to binary code for the target graphics processing pipeline. This binary code will consist of “instructions” which are specified in the instruction set specification for the given target graphics processing pipeline.
Thus, references to “expressions” herein, unless the context otherwise requires, refer to shader language constructions that are to be compiled to a target graphics processor binary code (i.e. are to be expressed in hardware micro instructions). (Such shader language constructions may, depending on the shader language in question, be referred to as “expressions”, “statements”, etc. For convenience, the term “expressions” will be used herein, but this is intended to encompass all equivalent shader language constructions such as “statements” in GLSL.) “Instructions” correspondingly refer to the actual hardware instructions (code) that are emitted to perform an “expression”.
A graphics shader performs processing by running small programs for each graphics “work item” in a graphics output to be generated, such as a render target, e.g. frame (a “work item” in this regard is usually a vertex or a sampling position (e.g. in the case of a fragment shader)). Where the graphics processing pipeline is being used for “compute shading” (e.g. under OpenCL or DirectCompute) then the graphics work items will be appropriate compute shading work items. This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of graphics work items (e.g. vertices or fragments), each of which can be processed independently.
In graphics shader operation, each “work item” is processed by means of an execution thread which will execute the shader program in question for the graphics “work item” in question.
The Applicants have recognised that many graphics shader programs will include calculations (expressions) that will produce identical values for sets of plural threads to be executed (e.g. for every thread in a draw call).
For example, the OpenGL ES vertex shader:
uniform mat4 a; uniform mat4 b; uniform mat4 c; attribute vec4 d; void main( ) {gl_Position = a * b * c * d; }will produce identical values for the computation of “a*b*c” for each thread (where each thread represents a given vertex), as the data inputs a, b, c are uniform variables, i.e. variables that are defined externally to the shader program and so are constant within the shader program.
FIG. 1 shows schematically the execution of multiple threads in parallel in a shader program, where the shader program includes “common” calculations (expressions) 1 (i.e. expressions that will produce the same value(s) each time they are executed for some or all threads in a group of threads that are executing the shader program) whose inputs comprise only uniform variables (variables that are defined externally to the shader program and so are constant within the shader program), followed by “non-common” or “per thread” calculations (expressions) 2 (i.e. that will (potentially) produce a different value(s) for each thread in a group of threads that are executing the shader program) whose inputs comprise non-uniform variables or attributes (i.e. that can potentially vary from thread to thread) together with the results of the common calculations 1.
As shown in FIG. 1, when the shader program is executed for multiple threads (thread 0 to thread 3) in parallel, one or more uniform variables are read from a memory 3 in which the uniform variables are stored, and each thread independently executes the common calculations (expressions) 1. The results of the common calculations 1 are stored in each thread's local register or registers 4. One or more other non-uniform variables or attributes are then read for each thread from an attribute memory 5 in main memory, each thread executes the non-common calculations (expressions) 2, and the results are stored in each thread's local register or registers 4. The final result is then written out 6 for each thread to a result memory 7 in main memory.
As the inputs to the common calculations (expressions) 1 comprise only uniform variables, the results of these calculations will be identical for all of the threads. Thus, if the computation of the common calculations (expressions) 1 could be executed once and the result shared between the plural threads, the execution of the shader program could be made more efficient.
The Applicants have previously proposed in their earlier UK patent application no. GB A 2516358 the use of a “pilot” shader program to execute once expressions that will produce identical values for a set of plural threads (e.g. for a draw call), and then a “main” shader program which is executed for each thread and uses the results of the “pilot shader”, instead of recalculating the common expressions each time.
This is illustrated by FIG. 2. As shown in FIG. 2, instead of each thread independently executing the common calculations (expressions) 1, a single “pilot thread” executes the common calculations 1, and the result is stored in main memory 8. This result is then shared between each of the plural threads by loading the result from main memory 8 into each thread's register 4. Each thread can then read one or more other non-uniform variables or attributes as appropriate from the attribute memory 5, execute the non-common calculations 2, and store the results in the thread's register 4. Again, the final results can be written out 6 for each thread to a result memory 7.
FIG. 3 shows the relevant functional units of a data processing system that are used to perform (and thus to act as) various ones of the processing operations described in relation to FIGS. 1 and 2.
As shown in FIG. 3, the data processing system includes a thread group generator 10, a thread generator 11, a thread scheduler 12, a (programmable) synchronous thread execution unit 13, a message passing unit 14, and a load/store unit 15 having an associated queue 16. Also shown in FIG. 3 are the register memory 4 that comprises each thread's register or registers, the main (off-chip) memory 8, and a further memory 3 that is used to store uniform variables to be used by executing threads, together with an associated preload unit 17.
The thread group generator 10 is operable to generate groups of threads for execution by the thread execution unit 13. As part of this operation, as shown in FIG. 3, the thread group generator 10 will cause one or more uniform variables for the thread group to be loaded into the memory 3 from the main memory 8 via the preload unit 17. The thread generator 11 is operable to generate (spawn) individual execution threads of each thread group. The thread scheduler 12 is operable to control the timing of the execution of the threads generated by the thread generator 11 (e.g. in the process of FIG. 2, the thread scheduler ensures that the main graphics work item threads are executed after the execution of the pilot thread has been completed).
The thread execution unit 13 operates to execute shader programs to perform the shader operations of the graphics processing pipeline. To do this, it receives execution threads from the thread scheduler 12 and executes the relevant shader program for those execution threads. As part of this operation, and as shown in FIG. 3, the execution threads can read uniform variables from the memory 3 and can read data from and write data to respective registers 4, in a synchronous manner (i.e. such that the shader program execution for a thread does not continue until the read or write operation has been completed).
The execution threads can also read data from and write data to the main memory 8 in an asynchronous manner (i.e. without the shader program execution for a thread waiting for the read or write operation to complete). This is done by sending requests to the load/store unit 15 via the message passing unit 14 and the queue 16. Data to be stored in main memory 8 is written from the register memory 4. Data read from main memory 8 is loaded into the register memory 4 via the message passing unit 14, from where it can be read synchronously by the execution threads.
Thus, for example, in the process shown in FIG. 2, once the execution unit has executed the pilot thread, its result is stored in main memory 8 by sending a request to the load/store unit 15 via the message passing unit 14 and the queue 16. The message passing unit 14 also informs the thread scheduler 12 that the execution of the pilot thread has completed, so that the main graphics item threads can then be executed. As part of their execution, the result from the pilot thread is shared between each of the plural main threads by loading the result from main memory 8 into each main thread's register 4.
Although the arrangements described above and in GB A 2516358 result in more efficient execution of the shader program, the Applicants believe that there remains scope for improved arrangements for graphics processing units that execute shader programs.
Like reference numerals are used for like components where appropriate in the drawings.