Graphics pipelines may be used to generate pixels for display on a screen. For example, a graphics pipeline may accept a representation of an image as an input and generate pixel representations of the image. In one example, the graphics pipeline is represented as a series of stages, wherein one such stage is a program or circuitry called a pixel shader. A pixel shader may receive interpolated vertex data (e.g., “primitives” such as triangles) and output pixel colors based on the interpolated vertex data.
Multiple pixel shaders may operate concurrently in order to achieve data parallelism in graphics devices. For example, a single pixel shader invocation might calculate color (and potentially other attributes) of a single pixel on a screen, wherein all pixels on a screen can be computed in parallel. Moreover, multiple pixel shaders operating in parallel may refer to the same screen location (e.g., same x, y coordinates). While each pixel shader invocation may be independent of other pixel shader invocations, graphics devices may guarantee that writes to render target resources are processed in a particular order.
Certain application programming interfaces (APIs) such as, for example, DIRECTX (registered trademark of Microsoft Corporation) and OPENGL (registered trademark of Silicon Graphics, Inc.) may provide ordering for render target write operations, but may lack any such order guarantee for other read/write (R/W) resources such as unordered access views (DIRECTX) or images (OPENGL). Traditional ordering techniques may maintain a linked list of objects in external memory and use global atomic operations to guarantee serialized access to the list. Such an approach, however, may introduce significant costs in highly parallelized graphics architectures. For example, global memory atomics may involve the use of atomic counters or synchronization primitives that require atomic operations (e.g., mutual exclusions/mutexes), which may significantly impact performance. Additionally, atomic operations themselves may consume memory bandwidth and interfere with other input/output (I/O) requests.