Since the widespread introduction of graphical user interfaces in computers more than 15 years ago, special-purpose graphics accelerators have been an integral component of desktop computer systems. Recently, 3D applications such as computer gaming, CAD, and visualization applications have been pushing high-performance 3D acceleration from specialty work-stations into mainstream PCs. The demand for increased 3D performance is currently insatiable. Single-chip 3D accelerators today are 50 times faster than those available 3–4 years ago. Each generation substantially improves the quality of the images and yet another factor of 100 still would not produce truly realistic scenes.
Like all computing systems, in order to achieve such rapid increases in performance, it is necessary to improve the microarchitecture of each generation to increasingly take advantage of parallelism. Over the past few years, 3D graphics engines have moved from 32 to 64 to 128-bit memory buses in much the same way that microprocessors grew from 4 to 8 to 16 and eventually to 64 bit bus lines. However, this advancement in 3D graphics engines has diminishing returns, especially as applications move toward a larger number of primitives for each scene to be displayed.
Current high-performance microprocessors all use instruction-level parallelism (ILP) to further increase performance. ILP exploits information about dependences between instructions to allow parallel execution of multiple instructions while maintaining the execution semantics of a sequential program. Many different ILP mechanisms can be effectively employed to improve performance, however dynamically scheduled, out-of-order, superscalar microprocessors are the commercially dominant microarchitecture at the present time.
This invention includes an approach for using dynamically scheduled, out-of-order, superscalar principles for increasing the available parallelism in a 3D graphics engine while maintaining sequential execution semantics. In the abstract, it would seem that graphics, and particularly 3D graphics, is a massively parallel application that would not need ILP technology for high performance. In fact, in many simple graphics applications, it would be possible to render each pixel independently; however, in practice, graphics applications have very similar characteristics to traditional sequential programs. Standard APIs like Direct3D or OpenGL are used to create graphics applications. These are translated via software drivers to a sequence of graphics primitives to be rendered by a graphics engine that consists of some combination of additional hardware and software. The programming model assumes that these primitives will be executed sequentially and atomically, in much the same manner that it is assumed that instructions in a traditional sequential ISA are executed.
Typically, 3D graphics applications will allow blocks of frame buffer memory to be directly read or written by the main processor. This requires that precise frame buffer state be available whenever a direct access executes. This creates a dependence between previous primitives and a direct read or subsequent primitives and a direct write. As in general-purpose computing systems, it is possible to build massively parallel systems that provide excellent performance on a limited set of applications that have been programmed with parallel execution in mind. However, in order to be compatible with a large existing base of software using widely accepted programming interfaces and programming styles, it is necessary to detect the dependences between graphics primitives, extract independent primitives from the instruction stream, and execute them concurrently. Therefore, in order to implement a parallel system for executing this sequence of primitives, the semantics of sequential execution must be maintained. More particularly, several factors cause dependences between graphics primitives which that can prevent concurrent or parallel execution of primitives in a 3D graphics engine.
Z-buffering
Realistic 3D graphics usually include hidden surface removal. More specifically, objects that are behind other objects from the perspective of the viewer should not be visible in the final image. Typically, a Z-buffer is used to implement hidden-surface removal. The Z-buffer stores the distance from the viewpoint to a currently drawn pixel so that when drawing a new pixel, it can be determined if the new pixel is in front of or behind the currently drawn pixel. A well-implemented Z-buffering algorithm should produce the same result even if the triangles are drawn in a different order. However, if the primitives are executed concurrently, then two triangles may be drawn concurrently. If each primitive attempts to concurrently read the same Z-buffer value, modify it, and then write the new value to the Z-buffer using common read and write operations, incorrect results can be produced. A special type of dependence thus exists between any two primitives that must access the same Z-buffer value. Although they can execute in either order, there is currently no known process for executing these primitives concurrently.
Alpha-blending
Alpha-blending is an operation that uses a transparency value (alpha) to permit some portion of an occluded object to be visible through a foreground object. Unfortunately, the primitives that execute alpha-blending operations to make objects appear transparent must be executed in order. The foreground and the background must be executed in order to make the transparent effect appear correct on the image. Accordingly, if the 3D graphics engine does not maintain the semantics of the sequence in which primitives are executed, the image will be incorrect.
Dynamic Textures, Procedural Textures, Environment Mapping
In another feature of realistic 3D graphics, an image (called a texture) can be mapped onto another image; for example, it may be part of the image to have objects reflected off of water displayed in the image. Often these textures have limited life-times. Procedural textures are created on-the-fly by program code. Dynamic textures are loaded into the graphics system memory space from some backing store for a limited time. Environment mapping is a technique for reflections where the 3D objects which is to be reflected is drawn and then copied as an image to be mapped onto a reflective surface. In each of these cases, there is a dependence between the primitives that create the texture and the primitives that render the reflective surface or polygon upon which the texture is projected. Once again, if the 3D graphics engine does not maintain the semantics of the sequence in which primitives are executed, the image will be incorrect.
2D BLITs
Often in 3D graphics, it is advantageous to be able to mix 2D block copy and drawing operations with 3D rendering. If overlapping 2D objects are read or written out of order, the resulting image is incorrect.
Direct Frame Buffer Access
Common graphics API's allow blocks of frame buffer memory to be directly read from or written to at the same time by the processor. This requires that the precise state of the frame buffer be available and known whenever an access to the frame buffer memory executes. This creates a dependence between any previously executed primitives and a direct read or any future executed primitives and a direct write.
Generally, a 3D application creates a series of frames. A 3D graphics engine then identifies each of the objects in a frame and breaks the surface of the object down into a collection of triangles for processing (typically the processing and drawings of the pixels within these triangles are represented by a serious of executable instructions which are referred to as primitives which are processed individually). Each triangle or primitive is specified by three vertices in a 3D space, one or more surface normals, and a description of how to draw the triangle's surface, i.e. texturing, alpha blending parameters, etc. Accordingly, from the point of view of a 3D graphics engine, a frame consists of a collection of triangles or primitives which are all processed and executed separately thereby rendering the entire frame or image. The 3D graphics engine is responsible for processing each triangle or primitive and converting them each into pixels, which when displayed render the entire 3D frame.
FIG. 1 illustrates a block diagram which shows a prior art 3D processing pipeline resident in a prior art 3D graphics engine. Generally, the graphics engine identifies the triangular coordinates for each primitive within the shared world space of the entire image, applies lighting to the triangles or primitives, transforms each triangle or primitive from the 3D space used by the application into 2D screen coordinates, and draws the appropriate pixels into the frame buffer (applying any shading, z-buffering, alpha bending etc.).
Referring now to FIG. 1, and more specifically, a first stage in a pipeline is a world transform stage 105, in which the graphics engine converts the vertices and normals of the triangle from the real world object space, which may be different for each object in the scene, to the shared world space, which is space shared by all of the objects to be rendered in the entire scene. This transform consists of a matrix-vector multiplication for each vertex and each normal. In a second stage of the pipeline, a lighting stage 110, the graphics engine takes the triangle's color and surface normal(s) and computes the effect of one or more light sources. The result is a color at each vertex. At the next stage in the pipeline, a view transform stage 115, the graphics engine converts the vertices from the world space to a camera space, with the viewer (or camera) at the center or origin and all vertices then mapped relative from that origin. Additionally, in the view transform stage 115, the graphics engine applies a matrix-vector multiplication to each vertex calculated for the camera space.
As further shown in FIG. 1, the next stage in the pipeline is a projection transform stage 120. At the projection transform stage 120, the graphics engine maps the vertices for the camera space to the actual view space. This includes the perspective transformation from 3D to 2D. Accordingly, at this point in the pipeline, the vertices are effectively two-dimensional to which perspective effects (i.e., depth foreshortening) have been applied. Accordingly, the third (z) coordinate is only needed to indicate the relative front-to-back ordering of the vertices when the objects are rendered or drawn within the view space. Like the other two transform stages in the pipeline, the projection transform stage requires the application of a matrix-vector multiplication per each vertex. In a clipping stage 125, the graphics engine clips the triangles or primitives to fit within the view space. Accordingly, the triangles or primitives which lie entirely off the side of the screen or behind the viewer are removed. Meanwhile, triangles or primitives which are only partially out of bounds are trimmed. This generally requires splitting the resulting polygon into additional multiple triangles or primitives and processing each one of these additional triangles or primitives separately. Finally, in a rasterization stage 130, the graphics engine converts those triangles to be displayed within the view space into pixels and computes the color value to be displayed at each pixel. This includes visible-surface determination (dropping pixels which are obscured by a triangle closer to the viewer), texture mapping, and alpha blending (transparency effects).
FIG. 2 further illustrates how a prior art rasterizer stage 130 in a 3D graphics engine operates. First, the rasterizer calculates the centers for each pixel in the triangle or primitive and assigns x and y values to these centers. The rasterizer stage then converts each triangle or primitive into a series of horizontal spans, with one span generated for each integer y value that falls inside the triangle. For each horizontal span, the rasterizer computes the two endpoints, i.e. the points where the horizontal span crosses the edges or boundaries of the triangle or primitive. The rasterizer will also interpolate color values and perspective-corrected texture coordinates for the endpoints. Next, the rasterizer generates the series of pixels along the span, again interpolating color and texture coordinates for each pixel between the two endpoints of the horizontal span. Several operations are then performed at each pixel. First each pixel has its z (depth) value compared to the z (depth) value for the currently displayed pixel in the same location. The currently displayed pixel has its z (depth) value stored in a z buffer. If the comparison indicates that this new pixel is behind the old one, the new pixel is discarded. If the comparison indicates that this new pixel is in front of the old pixel then the z test succeeds and the new pixel color is computed. This can include texture mapping and alpha blending. Accordingly, prior art 3D graphics are computed serially because each z (depth) value must be compared to the previously displayed z (depth) value.
Generally, a prior art 3D graphics engine will serially perform these steps on each triangle or primitive one at a time, such that the triangles or primitives are processed in an orderly fashion one after the other in series. One reason this is done is to avoid any dependencies which may occur between the primitives or triangles as they are each executed. As explained earlier, as each primitive is executed for processing, the z values for each pixel location in the triangle or primitive is compared with the z value previously displayed in that same location on the two dimensional screen in order to determine whether the new pixel should overwrite (appear in front of) the old value or be ignored (appear behind). If the triangles are not executed in order, then the z value-test results will be faulty.
However, the present invention is directed toward a method, apparatus and computer program product for parallel execution of primitives in 3D graphics engines. It includes detection and preservation of dependences between graphics primitives. Accordingly, the present invention has the ability to execute multiple independent primitives concurrently while preserving their ordering because the architecture of the graphics engine for the present invention further provides concurrent resources for parallel execution. In a first preferred embodiment, primitives are executed in parallel using an in-order dispatch unit capable of detecting dependencies between primitives. In a second preferred embodiment, an out-of-order dispatch unit is used such that not only are primitives executed concurrently; but, the primitives may be executed in any order when dependencies are detected.