Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data; however, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In an SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency. A general overview of software and hardware for SIMT architectures can be found in Shane Cook. CUDA Programming, Chapter 3, pages 37-51 (2013) and/or Nicholas Wilt. CUDA Handbook, A Comprehensive Guide to GPU Programming, Sections 2.6.2 to 3.1.2. (June 2013).
Conventional techniques process atomic operations serially (such as one at a time) when an address match is detected among single instruction, multiple data (SIMD) message slots. In existing graphics data port (GDP) or shared local memory (SLM) controller design, when multiple SIMD slots of an atomic message are mapped to the same address, atomic operations are serialized (such as processed one per cycle), which, in turn, results in an SIMD 16 atomic message (of 16 slots) can take up to 16 cycles to complete all the atomic operations. Such conventional techniques are inefficient and costly in terms of processing speed and system resources.