The present invention relates in general to computing systems, and in particular to multiprocessor computing systems in which heterogeneous processors cooperate to process data.
Graphics processing subsystems are designed to render realistic animated images in real time, e.g., at 30 or more frames per second. These subsystems are most often implemented on expansion cards that can be inserted into appropriately configured slots on a motherboard of a computer system and generally include a dedicated graphics processing unit (GPU) and dedicated graphics memory. The typical GPU is a highly complex integrated circuit device optimized to perform graphics computations (e.g., matrix transformations, scan-conversion and/or other rasterization techniques, texture blending, etc.), write the resulting pixels to the graphics memory, and deliver the pixels in real time to a display device. The GPU operates in response to commands received from a driver program executing on a central processing unit (CPU) of the system.
As is known in the art, it can be useful to produce a “family” of GPU designs. Within a family, each GPU is designed on a common underlying architecture, but each GPU design in the family has distinct features. For instance, different designs in the same family may provide different degrees of parallelism, allowing end users or other system builders to choose a combination of performance versus power consumption and/or chip area that best suits a particular intended use.
Different GPU designs in the family usually include some differences that affect the GPU's processing behavior in terms of command sets and/or results. For example, a bug present in an earlier design in the family might be corrected in a later design. Various optimizations can also be introduced, e.g., in the order of operations in a compound operation, that can affect the results of floating-point computations. The internal precision used in some or all of the functional units might also be changed (e.g., to reduce rounding errors or to save chip area). The later designs might also include support for additional commands beyond the command set supported by earlier designs. Any of these or other design differences can result in different GPUs in the same family producing different results, even if the same command sequence is delivered to each GPU. Such differences in results are usually (though not necessarily) minor, appearing in the least significant bits of floating-point numbers and having little or no effect on the final rendered image.
Recently, there has been interest in multi-chip graphics processing subsystems that include two or more GPUs operating in parallel. Such systems typically support a “distributed rendering” mode, in which rendering tasks are distributed among the GPUs. Tasks may be distributed in various ways, such as by assigning different GPUs to render different portions of an image or different images in a specific sequence. For example, in a “split frame rendering” mode, each GPU generates pixel data for a different portion of the displayable image, such as a number of lines of a raster-based display. The image is displayed by scanning out the pixel data from each GPU's display buffer and selecting a pixel generated by one or another of the GPUs depending on screen position. As another example, in an “alternate frame rendering” mode, each GPU generates pixel data for a different image in a temporal sequence (e.g., different frames of an animated image such as a 3D video game).
Rendering work can also be divided among GPUs in other ways. For instance, many video game programs implement a “Z-first” rendering approach in which each primitive is processed twice. In the first pass, a depth buffer (also referred to as a Z buffer) is populated with the depth coordinate (computed in viewing space) for the closest visible primitive at each fragment location. In the second pass, the depth coordinate for each primitive is recomputed, and fragment shading algorithms are executed for the primitive only if its depth coordinate matches the stored depth coordinate at a given fragment. In a multiprocessor graphics system, one GPU can perform the first pass, then deliver the populated Z buffer to a second GPU, which performs the Z-match test and fragment shading.
To the extent that the GPUs in a distributed rendering system do not produce identical results for the same operations, visible artifacts may occur in the rendered images. For instance, in split-frame rendering, if the final pixel colors are not identical, a seam may be visible along the line where one GPU's portion of the image ends and another GPU's begins. In Z-first rendering, if the depth coordinates calculated by the two GPUs are not identical, primitives that should be used for shading in the second pass might fail the Z-match test, or a primitive that should not be used might pass the Z-match test and be used. Either way, visible artifacts (e.g., incorrectly shaded pixels in small or large regions of the image) can result.
The requirement for identical results limits the ability to use heterogeneous GPUs (i.e., GPUs that have at least some design differences) for distributed rendering. Each heterogeneous combination needs to be individually tested to ensure that results from both GPUs in all relevant cases are identical down to the last bit. This testing can be prohibitive, and consequently, manufacturers generally choose to restrict the use of heterogeneous GPUs for distributed rendering. This restriction, in turn, limits the ability of users to upgrade and reconfigure existing single-chip systems for distributed rendering or to develop a distributed rendering solution optimized for a particular task.
It would therefore be desirable to provide a family of processors within which heterogeneous processors can be used cooperatively for distributed processing tasks, such as distributed rendering, while guaranteeing compatible results.