The present invention relates to 3D computer graphics, and particularly to maximizing processor throughput in 3D graphics systems.
Background: 3D Computer Graphics
One of the driving features in the performance of most single-user computers is computer graphics. This is particularly important in computer games and workstations, but is generally very important across the personal computer market.
For some years the most critical area of graphics development has been in three-dimensional (xe2x80x9c3Dxe2x80x9d) graphics. The peculiar demands of 3D graphics are driven by the need to present a realistic view, on a computer monitor, of a three-dimensional scene The pattern written onto the two-dimensional screen must therefore be derived from the three-dimensional geometries in such a way that the user can easily xe2x80x9cseexe2x80x9d the three-dimensional scene (as if the screen were merely a window into a real three-dimensional scene). This requires extensive computation to obtain the correct image for display, taking account of surface textures, lighting, shadowing, and other characteristics.
The starting point (for the aspects of computer graphics considered in the present application) is a three-dimensional scene, with specified viewpoint and lighting (etc.). The elements of a 3D scene are normally defined by sets of polygons (typically triangles), each having attributes such as color, reflectivity, and spatial location. (For example, a walking human, at a given instant, might be translated into a few hundred triangles which map out the surface of the human""s body.) Textures are xe2x80x9cappliedxe2x80x9d onto the polygons, to provide detail in the scene. (For example, a flat carpeted floor will look far more realistic if a simple repeating texture pattern is applied onto it.) Designers use specialized modelling software tools, such as 3D Studio, to build textured polygonal models.
The 3D graphics pipeline consists of two major stages, or subsystems, referred to as geometry and rendering. The geometry stage is responsible for managing all polygon activities and for converting three-dimensional spatial data into a two-dimensional representation of the viewed scene, with properly-transformed polygons. The polygons in the three-dimensional scene, with their applied textures, must then be transformed to obtain their correct appearance from the viewpoint of the moment; this transformation requires calculation of lighting (and apparent brightness), foreshortening, obstruction, etc.
However, even after these transformations and extensive calculations have been done, there is still a large amount of data manipulation to be done: the correct values for EACH PIXEL of the transformed polygons must be derived from the two-dimensional representation. (This requires not only interpolation of pixel values within a polygon, but also correct application of properly oriented texture maps.) The rendering stage is responsible for these activities: it xe2x80x9crendersxe2x80x9d the two-dimensional data from the geometry stage to produce correct values for all pixels of each frame of the image sequence.
FIG. 2 shows a high-level overview of the processes performed in the overall 3D graphics pipeline. However, this is a very general overview, which ignores the crucial issues of what hardware performs which operations.
In 3D graphics many of the geometric and pixel processing operations are done on a mixture of scalars and vectors (mainly 3 or 4 component vectors, but sometimes only 2 components).
Many architectures for high-speed computing have taken different approaches to optimizing both scalar and vector performance. However, 3D graphics is distinguished by the mixture of scalars with short vectors.
One approach to maximizing throughput would be to provide the capability to do all processing on 4-component vectors, using input swizzling and output masking to implement fewer-component-vector or scalar operations. However, this would be very wasteful in compute resources, since in many operations one or more ALUs will be idle.
Vector Instruction Set
The present application discloses an architecture in which vector operations are performed on a scalar ALU, and take up to 4 cycles are taken to process a 4-component vector. Each instruction states its own component count (scalar, 2-vector, 3-vector, or possibly 4-vector). Preferably the sequencer expands these instructions on-the-fly to produce the correct sequence of scalar instructions, and thus the scalar instruction count in memory has not increased over the vector instruction count.
This guarantees no wasted cycles in the ALU and gives more scope for stall removal between successive vector operations which are dependent (the ALUs are inevitably pipelined for speed).
Many CPU designs have replaced vector instructions with a sequence of scalar instructions is common practice on CPUs (you have no choice), but the encoding of the instruction""s vector length with its opcode provides an additional optimization which is not burdensome within the context of graphics computing, and provides a surprising increase in hardware utilization.