Technical Field
This disclosure relates generally to graphics processors and more specifically to programmable shader architecture.
Description of the Related Art
Graphics processing often involves executing the same instruction in parallel for different graphics elements (e.g., pixels or vertices). Further, the same group of graphics instructions is often executed multiple times (e.g., to perform a particular function for different graphics elements or for the same graphics elements at different times). Graphics processors (GPUs) are often included in mobile devices such as cellular phones, wearable devices, etc., where power consumption and processor area are important design concerns.
Graphics units typically utilize memory hierarchies with caches and local memories dedicated to particular processing elements and higher-level caches and memories that are shared among multiple processing elements. Further, some memories (such as system memory) may be shared with non-graphics processing elements such as a central processing unit (CPU). Some graphics architectures allow out-of-order memory accesses to occur and may cache data at various different levels in shared or local caches. Hardware to enforce memory consistency in these architectures may consume considerable power and may restrict performance.
Speaking generally, graphics work may be conceptually divided into three types: vector tasks, pixel tasks, and compute tasks. Vertex processing involves the use of polygons to represent images, where vectors define the polygons. The output of vertex shading is typically rasterized to generate fragment information which is operated on by pixel/fragment shaders to generate pixel data for output to a display. Compute processing involves other auxiliary tasks such as generating light lists, generating mipmaps (or other reduction algorithms), etc.
Typically, a graphics render pass (e.g., for pixel shader threads producing pixel data) is performed by a shader core using its local memory. Once the render is done, the results are written to device memory and available to other cores or processing elements. Therefore, compute tasks are traditionally performed between renders, using data in the device memory. Said another way, device memory is traditionally used to share data between compute tasks and pixel rendering tasks. Access to device memory may be a bottleneck in GPU designs, however, and may consume substantial power.
GPUs are typically expected to perform fragment/pixel operations in a particular order (in the order the operations were submitted to the GPU by a graphics program). Some GPUs are configured to generate “pass groups” of fragments, among which the order of operations do not matter (e.g., fragments in a given pass group typically did not overlap in screen space, so the order of operations within the group would not affect each other). Pass groups are traditionally ordered such that older pass groups completed before execution of younger pass groups. Enforcing pass group ordering may be a performance bottleneck and may consume significant power.