Parallel Processing. Using multiple computer CPUs simultaneously or in parallel, to solve a single problem, or execute a single program, and by doing so, reducing the time required, is an old and well-studied idea. In fact parallel processing is an entire sub-discipline of computer science. Any system for accomplishing parallel solution of a problem or execution of a program has two components: A ‘problem decomposition’ strategy or scheme or method, or combination of methods, and an execution vehicle or machine or system. In other words, the problem must be broken down into multiple parts, and then these parts must be distributed to and executed by the multiple CPUs. Problems can sometimes be broken down into parts that are independent, which may be pursued completely in parallel, with no interaction between, or no specific ordering of, sub-programs to be executed on the CPUs required. Sometimes problem decompositions have inter-dependent parts, implicit in the problem, or created by the decomposition.
Problem decomposition methods can be sorted into two large categories: decomposition by domain, where the function to be performed remains the same, and the data to be processed is distributed to multiple CPUs, and decomposition by function, where the work to be done on each datum is broken up into sub-functions, and each CPU is responsible for performing its sub-function on all the data. Both types of decomposition can be achieved through two major means—implicit or problem-aware, specific, ad hoc means, built into the system, or ‘algorithmic decomposition’. In algorithmic decomposition, the original program, or a representation of that program, which encapsulates the single-CPU, sequential semantics of a solution to the problem, is decomposed into multiple programs. Most interesting problem decompositions are a combination of both types of decomposition, using elements of both means of decomposition. The resulting CPU sub-programs may be completely independent, or ‘perfectly parallel’, or they may be organized into successive, overlapping, sub-functional stages, as in an assembly line or ‘pipeline’, or there may be any number of dependencies and independences, in any sort of dependency graph.
Systems of parallel execution of the sub-programs can be classified in terms of their similarity to two opposing models—those that have a central, master unit directing the flow of work, and those that are modeled as a de-centralized network of independent processors. Of course, many systems lie on the line somewhere in between these polar extremes.
As stated above, the field of parallel processing is rich in research, and there is much prior art. However there is as yet no general solution for all problems, and every parallel processing system is better at some sorts of problems than others. There are yet many problems with unexploited potential for parallelism, and many improvements may be made to parallel processing systems for different classes of problems.
Dynamic Code Generation. ‘Dynamic code generation’ is a technique whereby code is compiled or prepared for execution dynamically, by a program which will need to call or invoke it. This code is often created at the last possible moment, or ‘just-in-time’. If the code is created only when it is about to be used, it will not be generated if it is never used, and this can represent savings in compilation time and program space. After compilation, the new routine can be retained, or cached, in case it is needed again. The required routine may be called under a particular set of prevailing conditions or with specific arguments, that suggest a simpler, more efficient, custom compilation unique to that invocation or set of conditions. In that case, the dynamic compiler might create a special version of the code to be used only under those conditions or with a similar invocation. Dynamic compilation may also allow superior general-purpose optimizations due to facts unknown at the time the program in question was specified, but known at the time of execution.
Dynamic code generation has often been used in environments where there is no obvious ‘program’ to be compiled, where a fixed function is replaced by a run-time generated, run-time specialized and optimized routine, in order to gain improved performance over statically compiled, necessarily general code. Because the ‘program’ is often not represented in formal semantic terms, or is represented only by the previously compiled, machine code for the function to be replaced, and because of the need to produce new code quickly in a run-time environment, dynamic code generators and optimizers are frequently simple affairs, exploiting high-leverage, problem-aware ad hoc methods or tricks to achieve their ends. In this case, the more high-leverage, informal or implicit, problem-specific information that can be imparted to these code generators, the better they can potentially perform.
One application in which parallel processing and dynamic code generation may be combined is a three-dimensional graphical image rendering system, or ‘graphics pipeline’.
Definition of Graphics Pipeline. Three dimensional (3D) computer graphics display programs simulate, on a two dimensional display, the effect that the display is a window into a three dimensional scene. This scene can contain multiple 3D objects, at different apparent distances from the window, and the window has a viewpoint or camera angle with respect to the scene and its objects. Objects can be colored and textured, and the objects can seem to be illuminated by light sources of different types and color.
A software program that models and displays 3D objects can be divided into two parts: an ‘application program’ which relies on a set of high-level functions to manipulate and display graphical data, and a graphics software library that provides these functions.
3D objects consist of geometric shapes, at certain positions in the 3D world, with certain properties or attributes. These objects are defined and maintained by the application program, as a collection of geometric primitives, and then these primitives are defined and described to the graphics library, which draws, or renders them onto the two dimensional (2D) display, with all necessary positioning, orientation, perspective scaling, coloring, texturing, lighting, or shading effects performed on each primitive as it appears in the window view. This represents a series of processing steps on geometric primitives and their component data, as they progress from spatial coordinate and attribute definition to final 2D picture element (pixel) form on the screen. A software and hardware system that accomplishes this drawing of geometric primitives is called an image renderer, or a rendering ‘engine’, and the series of processing stages used is termed the ‘graphics pipeline’.
Definition of terms, description of pipeline processing stages. FIG. 1 shows a generic graphics pipeline 100 for a rendering engine according to the prior art. Different renderers support different options and features, and use various techniques to perform the required processing at each stage. Operations and stages can also be, explicitly or implicitly, performed in different orders in different implementations, while preserving the same apparent rendering model. Stages or portions of stages may be performed to varying degrees by either software or hardware. There are also many different groupings or organizations of the component operations into pipeline stages for the purposes of exposition, and the terminology in the art is not uniform from one implementation to another.
The following definitions are used in the descriptions of the graphics pipelines below:                Primitive: a collection of points in 3D space forming a point, a line, a triangle, or other polygon, with associated properties.        Vertex: one of the points defining a primitive.        Object: a collection of primitives.        Normal: for a point on the surface of a primitive, a vector defined to be normal or perpendicular to the surface of the primitive at that point.        Model space: a 3D coordinate space in which an individual object is defined, apart from a 3D scene in which it may be placed.        World space: the coordinate space of the 3D scene.        Viewport or Camera: the window, with its associated orientation, position and perspective relative to the scene, through which the 3D scene is apparently being viewed.        View space: the coordinate space of the 3D scene, as seen from the viewpoint of the camera.        Face: a planar polygon in an object, either front-facing (toward the camera), or back-facing (away from the camera).        Model Transformation: scaling and placing an object in the scene, transforming its vertex coordinates from model space to world space.        Viewing transformation: translating (moving, positioning), and rotating (orienting) vertices to account for viewing position and orientation with respect to the scene, transforming vertex coordinates from world space to view space.        Material: light reflectivity properties.        Texture, or texture map: an image, which may be designed to visually mimic the surface properties of a physical material.        Lighting: the interaction of light sources of different types and colors, with colors and materials and textures, at vertices.        Primitive assembly: determining primitives as defined by the application, and gathering their component vertex coordinates and attributes, in preparation for further processing.        Clipping: removing primitives or portions of primitives which are not visible, or fall ‘outside’ the field and depth of view of the viewport.        Projection Transformation: creating the 2D projection of points in view space, onto the plane of the viewport or “film” of the camera, transforming spatial coordinates of vertices to 2D display locations and depths.        Culling: removing (deciding not to render) a face of a polygon.        Vertex Processing: vertex coordinate transformations, and lighting of vertices.        Frame buffer: a 2D memory array containing bit patterns encoded in a form which directly represents the colored dots or rectangles on the computer's hardware display screen.        Pixel: a single colored picture element (dot or rectangle) in the frame buffer.        Fragment or pre-pixel: a single colored picture element, located in a 2D image corresponding to the frame buffer, before it is written to the display frame buffer.        Rasterize: to choose the fragments in the 2D projected image that correspond to the outline and/or interior of a primitive.        Shading, or Fragment Shading: determining the color of a fragment, taking into account vertex colors, lighting, and textures.        Buffer or Raster operations: raster (pixel) operations done on fragments after shading, as they are written to pixels in the frame buffer, or to determine whether or not they should be written, according to a number of tests.        Fragment processing: fragment shading and buffer operations on starting with fragments, and yielding pixels.        
A detailed description of the stages in the pipeline of FIG. 1 follows:
Transform 102: All vertices are transformed from model space to world space, and then transformed to view space, i.e., translated and rotated correctly in order to account for the viewpoint.
Light 104: Vertices are lighted from different sources, and the resulting color is dependent on the source color and intensity, incidence angle of a directional source with the vertex's normal, distance of the source, the reflectivity of an associated material, and the original vertex color. If the primitive is a polygon, and a texture is to be applied to the face, texture map coordinates are assigned to the vertices.
Assemble 106: Vertices are assembled into primitives, as they have been defined by the application program.
Project 108: Primitives are clipped to conform to the field and depth of view, the ‘viewing volume’. They are then projected, possibly with perspective, onto the plane of the viewport, yielding a 2D image, with each vertex position now represented as a 2D display location and a depth. Polygon faces to be culled are discarded, and not processed further.
Rasterize 110: Primitive fragments corresponding to outlines and interiors are identified in the 2D image. ‘Anti-aliasing’, or modification of fragment colors at outlines of primitives in order to make the outline appear smoother, is done at this stage.
Shade 112: Primitive fragments are shaded, or colored, according to one of several possible methods, by either interpolating the colors at the vertices of the enclosing primitive or by interpolating from vertex normals and re-lighting the fragments individually. If a texture is to be applied, texture map coordinates are interpolated and assigned to each fragment, and the indicated texture color is mixed in to yield the shaded fragment color.
Buffer 114: As fragments are converted to pixels and written to the frame buffer, several tests are performed in order to determine whether or not they should be written, in order to allow displaying the image inside a stencil, or window, or rectangle. Hidden surface removal may also be done by recording the depth, or ‘z’ value of a pixel in a ‘z-buffer’, as the pixel is written to the 2D frame buffer. As new pixels are written to the frame buffer, their depth or z value is compared to the z-buffer value of the pixel previously written at that 2D location. If the new pixel is closer to the viewport, it is written, if it is further away than (behind) the old pixel, it is not written.
Pixel colors may also be blended with the color of pixels already in the frame buffer, depending on the opacity of those colors, in order to simulate transparency of nearer surfaces. Pixel colors may be ‘dithered’ or modified based on their near neighbors as a way of smoothing color transitions or simulating shades. Finally, source and destination pixels in the frame buffer may be combined according to one of several logical operations performed as part of the block transfer (BLT) to the frame buffer.
Another view of a graphics pipeline according to the prior art is seen in FIG. 2. In this pipeline 200, there are just three stages: ‘Process Vertices’ 202, ‘Process Primitives’ 204, and ‘Process Fragments’ 206. FIG. 1 ‘Transform’ (model and view transformations) 102, and FIG. 1 ‘Light’ 104 (lighting) are collapsed into FIG. 2 ‘Process Vertices’ 202, yielding lighted, 3D position-transformed vertices. FIG. 2 ‘Process Primitives’ 204 combines FIG. 1 ‘Assemble’ 106 (primitive assembly), FIG. 1 ‘Project’ 108 (clipping, projection, and culling), and FIG. 1 ‘Rasterize’ 110 (rasterization) yielding visible fragments within the 2D image corresponding to primitive outlines and/or interiors. FIG. 2 ‘Process Fragments’ 206 incorporates FIG. 1 ‘Shade’ 112 (fragment shading and texture application to color fragments), and FIG. 1 ‘Buffer’ 114 (raster or buffer operations), finally yielding pixels 116 in the frame buffer.
In typical practice, aspects of the ‘Project’ 108 computation may be split across vertex processing and primitive processing. All vertex position transformations, including those due to projection onto multiple depth 2D planes, can be done in ‘Process Vertices’, while those aspects of projection necessary for clipping and final mapping to the viewport are done in ‘Process Primitives’. This may be done in order to group all like position transformations, involving matrix arithmetic on vertex vectors, into one phase. How parts of the logical graphics computations are actually effected in which stages is not of primary importance. More important is that each of the three large stages is concerned with processing associated with one major data type: either vertices, or primitives, or fragments.
Existing practice in graphics pipelines.
SIMD CPU instructions. Many computer CPUs now incorporate SIMD (single-instruction-multiple-data) types of instructions, which can perform certain single operations on multiple data at once. These instructions have been geared toward common low-level operations in the graphics pipeline, and software graphics library implementations can show dramatically improved performance through their use. It is important however, that the library organizes its computations so that data is available and staged accordingly, to take best advantage of these SIMD capabilities.
Multi-core CPUs. CPUs are now available with multiple instruction-processing cores, which may run independently of each other. If tasks in the graphics pipeline can be divided and scheduled so that many different operations can be done in parallel, independent threads of execution, this can provide a geometric speed increase over a single program that must perform all the operations in sequence. Multi-core techniques have heretofore seen limited application in software graphics pipeline implementations.
Hardware GPU functions. Many of the functions of a graphics pipeline can be performed by the hardware graphics processing unit, or GPU. GPUs support many fixed-functionality operations, and many also have the capability of running programs locally, independent of the computer CPU. Hardware GPU functions or GPU programs may be considerably faster than their main CPU software counterparts.
Shader Programs. ‘Vertex shaders’ or ‘vertex programs’, can optionally be supplied to the graphics library to perform some or all of the functions of vertex processing. Likewise, ‘Fragment Shaders’ or ‘Pixel Shaders’ can take over much of the job of fragment processing. These programs can be executed by the computer's CPU, or they may run in part or entirely on the hardware GPU. Several standards and languages exist for these vertex and fragment shader programs, which are then compiled for execution on CPU and/or GPU.
Programmable vertex and fragment processing allow flexibility and specialization in the performance of these operations, allowing new functionality, or higher performance. Support for programmable shaders is a required feature in several graphics library definitions, and many compatible implementations exist. However, the compilation of the shader program, the quality of the resulting code, and the use of CPU and GPU resources and their effects on performance, differ considerably from one implementation to another.
Dynamic code generation. Dynamic code generation is used in various ways in many aspects of existing fixed-function and programmable graphics pipelines, but generation and caching policies, language translation techniques and optimizations, and effectiveness and scope of utility vary with the implementation.
For example, in some graphics libraries, dynamic code generation is limited to the compilation of application-provided vertex and fragment programs. Or, if dynamic code is also used to accelerate fixed graphics pipeline functions, there may be some elements of the graphics pipeline implementation which must be implemented in a static fashion, or by separate dynamically created functions, to leave those stages ‘open’ for replacement by either application-provided or GPU-supported functions. The ideal case is to have all functions of the graphics pipeline supported by dynamically created code optimized for the specific CPU and GPU capabilities of the computer system.