An exemplary embodiment of the disclosed technology relates generally to graphics processing units (GPUs) and particularly to techniques that may reduce the electrical power consumption of a GPU when the GPU processes graphics or image data.
As it is known in the art, a graphics processing unit is a dedicated hardware module designed to render 2-dimensional (2D) and/or 3-dimensional (3D) computer-generated images for display on a computer screen or on a display device. GPUs are built with a highly pipeline structure and they may require less electrical power consumption than typical, general purpose central processing units (CPUs) for various computational intensive applications such as, but not limited to, video games, image visualization, graphics, graphical user interfaces etc. A GPU may perform various graphics operations to render an image.
As known by those skilled in the art, a preferable way to design a GPU is to rely on a pipeline approach to generate graphics data which can be output, for example, in a display device. A typical graphics pipeline includes a number of stages with the output from one stage possibly being used at another stage and the output of the latter stage possibly being used by another, third, stage and so on so forth. A typical graphics pipeline comprises a geometry preparation stage, a vertex shader stage, a primitive assembly generation stage, a position transformations stage, a primitive setup stage, a rasterization stage, an attribute setup stage, a hidden pixel rejection stage, a fragment shader stage, and a pixel blending stage.
In one embodiment of a graphics pipeline, the two stages of the graphics pipeline, which are programmable, are the vertex and the fragment shader stages. However, different arrangements are also possible, e.g., the rasterization stage may be also programmable or even the depth color related operations may be controlled by a programmable hardware engine. The two stages may be programmed by a general purpose software language (such as C or Fortran) or by an application specific graphics language such as HLSL, Cg, or GLSL.
As is known in the art, the vertex and the fragment stages are typically programmed with small in size, custom shading programs (similar to subroutines) that are invoked for each vertex and for each pixel fragments. Those small, but computationally and memory intensive programs, are usually referred to as shaders or shading programs while other terms can be used. The term shading program will be used hereafter.
An exemplary aspect does not necessarily pertain to a specific arrangement of the programmable stages of the graphics processing pipeline and it is more generally applicable. In particular, the disclosed technology is applicable in specific arrangements of the GPU in which the vertex and the fragment shading programs are executed by the same shader unit or by the same array of shader units (an arrangement known in the art as unified shaders). Furthermore, the disclosed technology is applicable in arrangements of the GPU in which fragment shading programs are executed by a dedicated fragment shader unit (or an array of fragment shader units) and the vertex shading programs are executed by dedicated vertex shader units (or an array of vertex shader units). In addition, the disclosed technology is not restricted to a particular shading or general purpose programming language.
The present technology at least provides a method to bypass the vertex shading unit(s) and/or the fragment shading unit(s) and assign the vertex and/or the fragment shading operations to another unit(s) and more preferably to the pixel blender unit. The bypass decisions may be taken by the GPU compiler or the GPU driver. The bypass decision may be based on a code-level analysis of the to be executed vertex and fragment shading program and a bypass decision may be lead to electrical power reductions even if such approach may lead to a decrease in the ratio that the rendered frames are generated and/or are stored to the frame buffer. Electrical power reductions may be achieved because a programmable pixel blender is typically a less complex circuit with a significantly smaller instruction set than a typical vertex, fragment, and/or a unified shader unit.
In accordance with one or more of embodiments, the pixel blender can be, for example, a multi-threaded, multi-format pixel blender as described in U.S. Publication No. 2013/0169658, entitled “Multi-threaded multi-format blending device for computer graphics operations”, the contents of which are incorporated herein by reference in their entirety.
Image blending was used from the start of motion picture generations (See U.S. Pat. No. 1,262,954). Blending was part of computer-based image processing since its origins (See U.S. Pat. Nos. 4,384,338, 4,679,040, and 4,827,344).
Original blender implementations were based on image multiplexing at the output to a screen via analog circuitry or on software programmes running on standard processors. This method is suitable for applications where high-speed software processing resources are available or where there is no high-speed requirement for the generation of the output images, as is the case with photograph editing.
In order to be able to process blending in real time systems, a hardware blender is required. Methods that implement blending in hardware have been proposed as described in the following paragraphs:
One of the first architectures of a blending apparatus was suggested in U.S. Pat. No. 5,592,196. This apparatus includes instructions for implementing the blending functions. These instructions are included in tables which form a blending mode, making the method fast but not as flexible as a full programmable approach.
A hardware implementation of blending targeted explicitly to 3D graphics has been disclosed in U.S. Pat. No. 5,754,185. This method did not include any programmability mechanism but rather defined blending mode via control signals.
Another hardware implementation is described in U.S. Pat. No. 5,896,136. This description mentions a unit that implements blending equations by using an alpha channel of lower resolution than the RGB channels.
In a structure described in U.S. Pat. No. 7,397,479 a method for providing programmable combination of pixel characteristics is disclosed.
Methods for implementing programmable blending were disclosed with U.S. Publication No. US 2006/192788 and U.S. Pat. No. 7,973,797. In both cases, the instructions for blending are provided by a processing unit loading formula or operation descriptors as a sequence to be executed by the blending hardware.
An apparatus for bypassing the fragment shaders in a GPU and assigning the fragment shading workload to a pixel blender is presented in U.S. Pat. No. 8,325,184 B2. The application is applicable only in GPUs following the unified shader approach and it requires significant modifications in the existing register file(s) of the shader cores.
Blending in the above referenced cases is defined as the process of generating a target pixel fragment value (T) by combining various inputs: a said source pixel fragment (S), a said destination pixel fragment (D) and corresponding alpha values (As, Ad) for the source and destination pixels. Depending on the blending mode a different function (f) is applied in order to calculate the target.
For calculating the target (T=f(S, As, D, Ad)), an arithmetic and logical unit (ALU) is employed that uses the inputs and the blending mode in order to produce the target value. For many blending modes, computing the formula in a single operation requires complex hardware. In order to minimize hardware using simpler operators, the outputs can re-enter the ALU a second time or more until the formula is calculated.
During this iterative process the blender cannot receive new inputs, thus complex blending modes result in lower overall throughput of the GPU. One method to achieve higher throughput is to implement the ALU as a pipeline of at least two threads. If the input pixel fragments can be provided in a continuous flow, the pipeline can produce one output per each clock cycle.
The current state of the art in color blending devices as described above provides fast and programmable functionality. Many different operations—from a predefined set—can be performed on sequences of pixel fragments, where each pixel is represented as a color (c, usually R,G,B) and alpha (α) combination.
One shortcoming of current implementations is that they are best fit for systems where the locations of subsequent pixel fragments are more or less continuous. In a modern GPU system, shader unit processing and communication to the main memory for reading and writing pixel fragments is a bottleneck. Thus, the system cannot generate a steady flow of continuous pixel fragments.
Another limitation is that most current implementations operate on integer or fixed-point representations. This makes it harder to interface with floating-point pixel sources and frame buffers. Furthermore, this limits the dynamic range of color representation for each pixel fragment.
Yet another limitation of most current solutions is that the programmability is constrained by a few predefined operators. In one case only (U.S. Pat. No. 7,973,797), the operation is guided by two instructions which can be configured by other entities in the GPU. A more flexible approach is required for full programmability, where any sequence of instructions including flow control can be provided as input in the form of a small program for the blender core.
All existing implementations support the RGBA color scheme that is very common in computer graphics; each pixel fragment is represented by three color channels of Red, Green and Blue (RGB) and an Alpha channel (A). However, if one has to blend non-RGBA pixel fragments (for example pixels in YUVA representation commonly used in video and photography), there needs to be another step of color space conversion, consuming time and bandwidth.
The inventors have found that a multi-threaded, multi-format programmable pixel blender that can be programmed to execute memory operations, i.e., instructions that load or store data or texture data from or to computer memory is a beneficial circuit for perfuming blending operations but also, other, non-blending operations. Such a blending device can save electrical power consumption by executing a number of vertex and/or fragment shading programs, i.e., the GPU shading units are bypassed. A particularly preferred arrangement of a pixel blender device for the operations is disclosed in U.S. application Publication No. 2013/0169658.
All the Patents and Patent Applications referenced above are incorporated herein by reference in their entirety.