The technology described herein relates to data processing systems, and in particular to the operation of graphics processing systems that include one or more programmable processing stages (“shaders”).
As is known in the art, graphics processing is typically carried out in a pipelined fashion, with one or more pipeline stages operating on the data to generate the final render output, e.g. frame that is displayed. Many graphics processing pipelines now include one or more programmable processing stages, commonly referred to as “shaders”. For example, a graphics processing pipeline may include one or more of, and typically all of, a geometry shader, a vertex shader and a fragment (pixel) shader. These shaders are programmable processing stages that execute shader programs on input data values to generate a desired set of output data (e.g. appropriately transformed and lit vertex data in the case of a vertex shader) for processing by the rest of the graphics pipeline and/or for output. The shaders of the graphics processing pipeline may share programmable processing circuitry, or they may each be distinct programmable processing units.
A graphics processing unit (GPU) shader core is thus a processing unit that performs graphics processing by running small programs for each graphics item in a graphics output to be generated such as a render target, e.g. frame (an “item” in this regard is usually a vertex or a fragment (pixel)). This generally enables a high degree of parallelism, in that a typical render output, e.g. frame, features a rather large number of vertices and fragments, each of which can be processed independently.
As is known in the art, a shader program to be executed by a given “shader” of a graphics processing pipeline will be provided by the application that requires the graphics processing using a high-level shader programming language, such as GLSL, HLSL, OpenCL, etc. This shader program will consist of “expressions” indicating desired programming steps defined in the relevant language standards (specifications). The high-level shader program is then translated by a shader language compiler to binary code for the target graphics processing pipeline. This binary code will consist of “instructions” which are specified in the instruction set specification for the given target graphics processing pipeline. The compilation process for converting the shader language expressions to binary code instructions may take place via a number of intermediate representations of the program within the compiler, as is known in the art. Thus the program written in the high-level shader language may be translated into a compiler specific intermediate representation (and there may be several successive intermediate representations within the compiler), with the final intermediate representation being translated into the binary code instructions for the target graphics processing pipeline.
It is common in modern graphics processing units to have instructions to perform “atomic” memory operations. An “atomic” memory operation is an operation sequence that reads a memory location, performs an arithmetic operation between the memory value and an operand value (typically a register value), and then writes the result of the arithmetic operation back to the same memory location. This sequence of operations is carried out so that to every observer, it looks as if the sequence has either not been performed at all, or been performed in its entirety. It is executed as one indivisible unit, hence the name “atomic”.
The arithmetic operation that is performed as part of the atomic is usually a simple operation that is both commutative and associative. Common examples of such operations are: Integer Add; Bitwise AND, OR, XOR; Minimum Value; Maximum Value.
Other atomic operations that are common but not commutative are:
Swap (also know as Exchange)—the register value is written to the memory location and the old value of the memory location is then returned to the shader (this operation is associative);
Compare-And-Swap (Compare and Exchange)—two register values are provided, the memory value is compared to the first of them, and if they compare equal, then the second register value is written to the memory location (this operation is associative if the comparison value is held constant).
Atomic memory operations typically occur in two variants: “Return” and “NoReturn”. The “Return” variants return back to the shader the datum that was present at the memory location before the “atomic” operation was performed. The “NoReturn” variants do not return any value back to the shader. Typically, in a GPU, all of the operations listed above are available in both variants (except Swap, which is not available in the NoReturn variant).
Typical examples of use-cases for GPU atomics include incremental memory allocators (Integer Add of the “Return” type), histograms (Integer Add of the “NoReturn” type) and bounding-box computation (Minimum/Maximum Value of the “NoReturn” type).
If multiple atomics are performed on the same memory location, they need to be serialized. This is necessary to maintain the observable semantic that each of them has been performed either in full or not at all. This can then cause throughput issues in systems with multiple data caches and a coherency protocol.
For example, in such an arrangement, a processing core wishing to perform atomic operations to a given memory location will first need to fetch from memory the cache line storing the data for the memory location in question into a line of its cache. Then, when the cache line is present in the cache, the atomic operations can be performed sequentially, one at a time, using the data in the cache line. However, while the cache line is resident in the cache of the core in question, other cores that use different caches and that require the data (the cache line) in question cannot perform their atomic operations, such that the atomic operations for those cores must be stalled until the cache line is released by the cache that currently has it (such that it can then be fetched into the cache of another core).
Thus, while a core is waiting for the cache line (i.e. the cache line is not present in its cache), the atomic operations for that core are stalled and cannot be performed at all. Similarly, when the cache line is travelling from one cache to another, no atomic operations can be performed on the data in the cache line at all.
This all has the effect that throughput of atomic operations to the same memory location in systems that use multiple caches can be relatively slow, as the processing core or cores have to wait for the cache line to travel to their respective caches before the atomic operations can be performed.
FIGS. 1 and 2 illustrate this operation.
FIG. 1 shows an exemplary data processing system that has four processing cores 40, 41, 42, 43 each with a respective level one (L1) cache 44, 45, 46, 47. Each processing core is capable of performing atomic operations, and when an atomic operation is to be performed to a given memory location, a cache line containing the data at that memory location is first loaded from a main memory 48 into a cache line in the respective level one cache of the core in question. As discussed above, if more than one of the processing cores 40, 41, 42, 43 is performing an atomic operation to the same location in the main memory 48, then the cache line from the main memory 48 will need to be loaded into the respective level one caches of the cores that need it in a serial fashion in order to maintain the necessary atomic operation semantics. This then has the effect that if the cache line is in a respective level one cache of one of the processing cores, any other processing core that requires that cache line must stall its atomic operations until the cache line is fetched into the cache for the core in question.
FIGS. 2A and 2B illustrate this operation and shows by way of example a set 50 of threads A, B, C, D that are to perform an atomic operation (which in this is case is an atomic ADD operation) for one of the processing cores 40. At the start of the process, the threads 50 to be executed are ready for execution by the core 40, but it assumed that the relevant cache line (data) from the main memory has not been fetched into the cache 41 for the processing core 40.
A request 57 is sent to fetch the relevant cache line into the cache 41 for the processing core 40. At the same time, space 58 for the fetched cache line is set aside in the cache 41 (but the cache line has not yet arrived).
Then, some time later the requested cache line 51 that contains the data at the memory location (address) in question arrives into the data cache 41 for the processing core 40. The first atomic operation 52 (for the thread A) is then executed. As shown in FIG. 2, the atomic operation for the thread A 52 is to increment the value in the memory address by one (+1), and so the value in the cache line 51 for the memory location in question is incremented by one (so to “124” in this case).
Thread A also returns the previous value 54 (“123” in this case) for the memory address to the processing core (as in this case it is assumed that the atomic operation is of the return type).
It is then assumed, as shown in FIG. 2A, that a snoop request 59 is made for the data in the cache line 51 (i.e. that another part of the system requests the data in the cache line 51), and that in response to this, the cache line 51 is evicted 33 from the data cache 41 for the processing core 40. The atomic operations for the threads B, C, D, must therefore be stalled as the data cache 41 for the processing core 40 no longer contains the relevant cache line. Thus, as shown in FIG. 2A, a further request 32 to fetch the cache line must be sent.
Then, once the relevant cache line arrives again in the data cache 41 for the processing core 40 in question, the next thread, thread B, executes its atomic add operation 53 to increment the value in the cache line for the memory address by two (+2), and also returns the previous value 56 (in this case “124”) for the memory address to the processing core.
It is then assumed that again a snoop request causes the cache line to be evicted from the data cache 41 and so before the thread C can perform its atomic operation, a further fetch request and wait until the cache line arrives in the data cache 41 takes place.
Again once the cache line arrives in the data cache for the processing core, the next thread C can perform its atomic addition operation 55 and return its relevant output value, and so on for the thread D, as shown in FIG. 2B.
At the end of this operation, the cache line in the data cache 41 for the processing core 40 will store the accumulated result of the atomic operations, which can then be returned to main memory and/or passed to the cache for another core for that core to perform its atomic operations on the memory location in question.
It can be seen that in this operation, if a large number of snoop requests are received while the atomic operations are being performed, the throughput of the atomic operations can be considerably delayed.
The Applicants believe that there remains scope for improvements to the handling of atomic operations in systems that use multiple, e.g. L1, caches, for example in graphics processing pipelines that include one or more shader stages.
Like reference numerals are used for like components where appropriate in the drawings.