Background: 3D Computer Graphics
One of the driving features in the performance of most single-user computers is computer graphics. This is particularly important in computer games and workstations, but is generally very important across the personal computer market.
For some years, the most critical area of graphics development has been in three-dimensional (“3D”) graphics. The peculiar demands of 3D graphics are driven by the need to present a realistic view, on a computer monitor, of a three-dimensional scene. The pattern written onto the two-dimensional screen must, therefore, be derived from the three-dimensional geometries in such a way that the user can easily “see” the three-dimensional scene (as if the screen were merely a window into a real three-dimensional scene). This requires extensive computation to obtain the correct image for display, taking account of surface textures, lighting, shadowing, and other characteristics.
The starting point (for the aspects of computer graphics considered in the present application) is a three-dimensional scene, with specified viewpoint and lighting (etc.). The elements of a 3D scene are normally defined by sets of polygons (typically triangles), each having attributes such as color, reflectivity, and spatial location. (For example, a walking human, at a given instant, might be translated into a few hundred triangles which map out the surface of the human's body.) Textures are “applied” onto the polygons, to provide detail in the scene. (For example, a flat, carpeted floor will look far more realistic if a simple repeating texture pattern is applied onto it.) Designers use specialized modelling software tools, such as 3D Studio, to build textured polygonal models.
The 3D graphics pipeline consists of two major stages, or subsystems, referred to as geometry and rendering. The geometry stage is responsible for managing all polygon activities and for converting three-dimensional spatial data into a two-dimensional representation of the viewed scene, with properly-transformed polygons. The polygons in the three-dimensional scene, with their applied textures, must then be transformed to obtain their correct appearance from the viewpoint of the moment; this transformation requires calculation of lighting (and apparent brightness), foreshortening, obstruction, etc.
However, even after these transformations and extensive calculations have been done, there is still a large amount of data manipulation to be done: the correct values for EACH PIXEL of the transformed polygons must be derived from the two-dimensional representation. (This requires not only interpolation of pixel values within a polygon, but also correct application of properly oriented texture maps.) The rendering stage is responsible for these activities: it “renders” the two-dimensional data from the geometry stage to produce correct values for all pixels of each frame of the image sequence.
The most challenging 3D graphics applications are dynamic rather than static. In addition to changing objects in the scene, many applications also seek to convey an illusion of movement by changing the scene in response to the user's input. Whenever a change in the orientation or position of the camera is desired, every object in a scene must be recalculated relative to the new view. As can be imagined, a fast-paced game needing to maintain a high frame rate will require many calculations and many memory accesses.
Background: Texturing
There are different ways to add complexity to a 3D scene. Creating more and more detailed models, consisting of a greater number of polygons, is one way to add visual interest to a scene. However, adding polygons necessitates paying the price of having to manipulate more geometry. 3D systems have what is known as a “polygon budget”, an approximate number of polygons that can be manipulated without unacceptable performance degradation. In general, fewer polygons yield higher frame rates.
The visual appeal of computer graphics rendering is greatly enhanced by the use of “textures”. A texture is a two-dimensional image which is mapped into the data to be rendered. Textures provide a very efficient way to generate the level of minor surface detail which makes synthetic images realistic, without requiring transfer of immense amounts of data. Texture patterns provide realistic detail at the sub-polygon level, so the higher-level tasks of polygon-processing are not overloaded. See Foley et al., Computer Graphics: Principles and Practice (2.ed. 1990, corr. 1995), especially at pages 741-744; Paul S. Heckbert, “Fundamentals of Texture Mapping and Image Warping,” Thesis submitted to Dept. of EE and Computer Science, University of California, Berkeley, Jun. 17, 1994; Heckbert, “Survey of Computer Graphics,” IEEE Computer Graphics, November 1986, pp. 56; all of which are hereby incorporated by reference. Game programmers have also found that texture mapping is generally a very efficient way to achieve very dynamic images without requiring a hugely increased memory bandwidth for data handling.
A typical graphics system reads data from a texture map, processes it, and writes color data to display memory. The processing may include mipmap filtering which requires access to several maps. The texture map need not be limited to colors, but can hold other information that can be applied to a surface to affect its appearance; this could include height perturbation to give the effect of roughness. The individual elements of a texture map are called “texels”.
Awkward side-effects of texture mapping occur unless the renderer can apply texture maps with correct perspective. Perspective-corrected texture mapping involves an algorithm that translates “texels” (pixels from the bitmap texture image) into display pixels in accordance with the spatial orientation of the surface. Since the surfaces are transformed (by the host or geometry engine) to produce a 2D view, the textures will need to be similarly transformed by a linear transform (normally projective or “affine”). (In conventional terminology, the coordinates of the object surface, i.e. the primitive being rendered, are referred to as an (s,t) coordinate space, and the map of the stored texture is referred to a (u,v) coordinate space.) The transformation in the resulting mapping means that a horizontal line in the (x,y) display space is very likely to correspond to a slanted line in the (u,v) space of the texture map, and hence many additional reads will occur, due to the texturing operation, as rendering walks along a horizontal line of pixels.
One of the requirements of many 3-D graphics applications (especially gaming applications) is fill and texturing rates. Gaming and DCC (digital content creation) applications use complex textures, and may often use multiple textures with a single primitive. (CAD and similar workstation applications, by contrast, make much less use of textures, and typically use smaller polygons but more of them.) Achieving an adequately high rate of texturing and fill operations requires a very large memory bandwidth.
Background: Conventional Single Instruction, Multiple Data (SIMD) Processor
A conventional SIMD processor or microcode CPU is designed with the sequencer and arithmetic logic unit (ALU) running in lock step. The sequencer is responsible for calculating the address of the next instruction and fetching it. Fields in the instruction will control how the address for the next instruction is calculated, and other fields will define the ALU operation. The sequencer and ALU will operate in a fixed-phase relationship depending on the degree of pipelining. Sequencer operations other than simple increment can cause stalls and prevent the ALUs from running at maximum efficiency.
Very long instruction word (VLIW) instructions explicitly specify several independent operations. Because VLIW instructions explicitly specify parallelism, it is not necessary to reconstruct parallelism from a serial instruction stream. VLIW architectures aim to reduce stalls by having the microcode instruction wide enough so that sequencer and ALU operations can be expressed in the same instruction, but this does not help with other sources of delay.
With a SIMD ALU (called Fragment Processor in FIG. 1), the cost of a missed ALU cycle is now much greater as each processing element (PE) in the SIMD array will be idle.
In addition to avoiding sequencer stalls, it is also desirable to hide the latency of memory accesses—global and instruction cache misses, but more importantly texture accesses that can take 100+ cycles to resolve. One way to hide this latency is to run the programs multi-threaded. In graphics, the same program is run on many pixels so if the program stalls because of a memory read for one set of pixels, the program can be switched to run on another set of pixels, and so on. With enough threads, memory latency can be hidden. The sequencer detects if the current thread is about to access a register with an outstanding load from memory (or an instruction or global cache miss) and will switch to running another thread that can do useful work. Thread-switching must be extremely light weight, i.e. computationally cheap, as switching threads is expected every few cycles. Therefore, the cost should ideally be zero sequencer cycles, but in practice, it takes 2 cycles because of the pipelining. ‘Few’ is typically 4 to 8 cycles, but this depends purely on the program that has been loaded. Accordingly, the system cannot afford to have the SIMD array idle while switching threads.
Sequencer with Async SIMD Array
The present application describes a 3D graphics architecture in which a FIFO buffer is placed between the sequencer and the processing elements (PEs). The sequencer and PEs are not designed to run in lock step: instead the sequencer and PEs are decoupled to allow the PEs, which form the SIMD array, to run at 100% efficiency even when the sequencer is switching between threads and performing other flow control operations. Thus, the rate of instruction processing in the PE is not coupled to the rate of instruction processing in the sequencer. As a result, the PE and sequencer can be asynchronous. The PE and sequencer are logically asynchronous, i.e. decoupled, in that they do not have to work on the same instruction at the same time, but as an implementation convenience, they run in the same clock domain so technically they are physically synchronous.
The disclosed innovations, in various embodiments, provide one or more of at least the following advantages:                Avoids sequencer stalls.        Hides the latency of memory accesses.        Increased speed.        Increased efficiency.        