Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data. However, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In an SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency. A general overview of software and hardware for SIMT architectures can be found in Shane Cook, CUDA Programming Chapter 3, pages 37-51 (2013).
Increasingly, graphics processors are being implemented within global shared memory (GSM) systems coupled via an interconnect fabric to communicate between multiple processing nodes. Current fabric designs implement synchronous data transfers, which provide a more deterministic behavior in small graphics processor designs. However, as the graphics processors become larger, and significantly more data has to be transmitted over a large die, the cost of transmitting this data from source to destination is becoming more expensive.
The majority of additional costs is attributed to the need to flop the contents at every clock edge to provide a simple and deterministic behavior. However, with large graphics processors and higher clocks speeds, the interval between the flop stages are reduced while flop counts are increasing. Such costs have a direct impact not only to die-area, but also to the power of the GPU which is purely spent on carrying data around.