1. Field of the Invention
The present invention generally relates to graphics vector processors and more particularly to a graphics processor with scalar arithmetic logic units (ALUs) capable of processing graphics vector data.
2. Description of the Prior Art
Graphics data can be represented in a vector format with components of geometry information (XYZW) or pixel value information (RGBA). Typically, the geometry engines used with these vectors process all of the components at once leading to complicated internal architecture and relatively high latency between data input and data output. The typical geometry engine is an important front-end part of any modern graphics accelerator. The speed of the geometry data processing affects the entire efficiency of the architecture of the graphics accelerator.
Recent graphics API developments require the support of particular instruction sets and define the hardware capabilities to process geometry and pixel value vectors. Because of these high performance requirements, current graphic engines are implemented as a unit that process all vector components in parallel with complicated input data and internal data crossbars. Furthermore, in order to meet these performance requirements, the graphics engines use multiple vector units in SIMD (Single Instruction, Multiple Data) or MIMD (Multiple Instruction, Multiple Data) architecture with additional hardware and time overhead. This leads to VLIW (Very Large Instruction Word) architecture with complex control and synchronization units supporting multithreaded execution of programs.
Referring to FIG. 1, a data flow 10 for a prior art vector processing unit is shown. A graphics vector 12 having components Xi, Yi, Zi, and Wi is inputted into a buffer memory 14. Each graphics vector 12 is read sequentially from the buffer memory 14 into a vector ALU 16. The single vector ALU 16 operates on each component of the vector 12 at the same time in parallel. The vector ALU 16 includes a special function unit 18 for performing special operations. The internal structure of the ALU 16 is large and complicated in order to perform operations on all four components (i.e., Xi, Yi, Zi, and Wi) of the vector 12. Furthermore, the internal protocols and communication of the ALU 16 are complicated due to the parallel nature of the operations being performed. A final output vector 20 having components Xout, Yout, Zout, and Wout is generated by the vector ALU 16. The architecture of the prior art vector processing unit can be considered parallel (full vector or horizontal) vector component flow because the components of each vector 12 are processed concurrently.
Referring to FIG. 2, a datapath representation for processing one set of data with the prior art vector processing unit is shown. In the example shown in FIG. 2, the function is:
vector Normalized_Difference (vector V1, vector V2)V1 −> r0.xyzwV2 −> r1.xyzw(xyzw - components of graphics data)
The corresponding instructions for this function are:
SUB r2, r0, r1//subtraction of all componentsDP3 r3.x, r2, r2//dot product of 3 components (x, y, z) with result inx-componentRSQ r3.x, r3.x//reciprocal square root of result in x-componentMUL r2, r2, r3.x//scaling all components with RSQ result
Referring to FIG. 2, the first instruction cycle (1) performs the subtraction between r0 and r1 and generates output vector r2 for each of the components x,y,z, and w. Next, in the second instruction cycle (2), the dot product is performed on r2 itself with the result only in the x component such that r3.x is generated. The reciprocal square root of r3.x is operated upon in the third instruction cycle (3). As seen in FIG. 2, during the third instruction cycle (3), only the x component is being operated upon. Next, in the fourth instruction cycle (4), the r2 components are scaled only by the x component (i.e., r3.x) to generate the normalized vector difference r2. In order to process four sets of data, the process is repeated four times and would take a total of sixteen instruction cycles.
It can be seen that the prior art vector processing unit can be very complex due to the parallel processing of vector components. Accordingly, latency becomes an issue during the processing. Furthermore, the prior art vector processing unit needs a large instruction format with multiple bits to control the vector component routing and processing. Also, the prior art vector processing unit has a complex input data bus to support the required graphics API functionality. Also, data dependency detection by hardware or software is required when using the prior art vector processing unit.
The present invention addresses the deficiencies in the above-mentioned prior art vector processing units by providing a vector processing unit that uses scalar ALUs. Accordingly, the present invention provides a SIMD scalar processing unit which is less complex and smaller in size than the prior art units. Furthermore, the present invention provides a system whereby the instruction set is simpler than the prior art vector processing unit and latency is greatly reduced.