FIG. 1 is a block diagram of a pipeline configuration of a conventional graphic processor unit. The conventional graphic processor unit 100 mainly includes a triangle setup unit 102, a pixel processing unit 104 and a depth processing unit 106. The pixel processing unit 104 has a pixel shader 108, a texture unit 110 and a color interpolator 112 both connected to the pixel shader 108.
A surface of three-dimensional (3D) object is divided into a plurality of triangles two-dimensionally arranged in terms of their neighboring relationship and having an arbitrary size. Each of the triangles has three vertices which are forwarded to the triangle setup unit 102. The triangle setup unit 102 outputs the parameters of the pixels, such as the positions of the pixels in triangles and texture coordinates of the vertices of the corresponding triangles, to the pixel processing unit 104. In the pixel processing unit 104, based on the positions of the pixels and texture coordinates of the vertices, the texture unit 110 interpolates the texture coordinates for all the pixels. The interpolated texture coordinates of the pixels are inputted and then processed in the pixel shader 108 (with DirectX terms, or Fragment Processor in OpenGL terms). Next, the pixel shader 108 executes a texture load instruction to return the processed texture coordinates to the texture unit 110. Based on the unprocessed texture coordinates and the processed texture coordinates, the texture unit 110 samples the texture colors of the pixels in a texture map and outputs the texture colors to the pixel shader 108. Meanwhile, based on the positions of the pixels and texture coordinates of the vertices, the color interpolator 112 interpolates the vertex colors for all the pixels and outputs the vertex colors of the pixels to the pixel shader 108. The pixel shader 108 then processes the texture colors and the vertex colors of the pixels and outputs color values and depth values of the pixels to the depth processing unit 106, the final pixel colors are obtained. The final pixel colors are then becoming available for drawing the whole frame.
FIG. 2 is a block diagram of an example program in a pixel shader of the conventional graphic processor. The pixel shader 108 usually includes five kinds of registers: temporary registers rn for storing temporary data, texture coordinate registers tn, texture numbering registers sn, vertex color registers vn, and outputting registers ocn for transforming the final pixel colors to the depth processing unit 106.
The process of the pixel shader 108 normally has four stages: a coordinate calculation stage, a texture processing stage, a color blending stage and an issue out stage. The interpolated texture coordinates of the pixels from the texture unit 110 are stored in the texture coordinates registers tn. In the coordinate calculation stage, the arithmetic, for the interpolated texture coordinates of the pixels from the texture unit 110, is conducted in the texture coordinates registers tn and the temporary registers rn; the arithmetic results, i.e. the processed texture coordinates, are stored in the temporary registers rn. In the texture processing stage, based on the texture coordinates in the registers tn and rn, the pixel shader 108 executes texture load instructions to postulate the texture unit 110 to sample texture colors of the pixels in a texture map. The texture map is appointed by the texture numbering registers sn. The sampled texture colors are transformed to the temporary registers rn. In the color blending stage, the pixel shader 108 blends the texture colors stored in the temporary registers rn with the vertex colors from the color interpolator 112 and the blending result is stored in the vertex color registers vn. In the issued stage, the pixel shader 108 outputs color and depth values of the pixels to the depth processing unit 106. It should be noted that the coordinate calculation stage, the texture processing stage and the color blending stage may be repetitiously processed or be omitted, respectively.
Each of the registers is composed of four components, e.g. (x, y, z, w) or (r, g, b, a) which are so-called four-wide vectors and data format of floating point. In the coordinate calculation and texture processing stages, the four components (x, y, z, w) represent coordinates in a three-dimensional (3D) space or of different texture formats. In the color blending and issued stage, the four components (r, g, b, a) represent three primary colors of red, green and blue, and transparency. The components of source and target registers are assigned to instructions to read out or write the components. For example, r0.w represents the instructions that can read out or write component “w” of register “r0”.
Since processing steps of color components “r”, “g”, and “b” are considerably different from the transparency component “a”, there is a need of two independent pipelines to process these different kinds of components. When representing coordinates, “x”, “y” and “z” are also considerably different from the perspective component “w”. In DirectX standard, two independent pipelines are serially merged and concurrently issued out by a plus sign “+” preceding the second instruction of the pair, which is defined as instruction pairing or co-issue and has a component ratio of 3 to 1, as shown in FIG. 3A. However, the number of operator decoders, pipelines, register write ports and register read ports for the instructions is increased at least double the amount. Moreover, it is necessary to provide additional complicated functions, such as component selection, format transformation, source modification, and instruction modification in the pixel shader so that instructions can process operands located in the source and target registers. As a result, hardware cost of performing the functions is increased extremely.
Referring to FIG. 3B, a ratio diagram of two color components to two transparency components for the instructions in a conventional pixel shader program is illustrated here. In these two independent instructions, one is used to write color components “r” and “g”, and the other is used to write color components “b” and transparency “a”. Although the probability of instruction pairing or co-issue is increased, however, it has a more complicated architecture and a higher cost in the hardware of pixel shader. The nVidia Corporation began to implement such complicated co-issue in their GeForce6 Series GPU.
Referring to FIG. 4, a conventional pixel shader with a co-issue mechanism is shown here. The fetcher 400 reads out two instructions from the instruction queue 402 according to the program counter (PC). A pair of decoders (404a, 404b) decodes control signals from the fetched instructions, respectively, to control the pipeline operation of the arithmetic logic units (ALUs) (406a, 406b). The pair of ALU (406a, 406b) implements four vector components in parallel and consumes a pair of register ports (408a, 408b). Each of register ports (408a, 408b) includes three register read ports and a write port. Furthermore, it is necessary to use a source and an instruction modifier for each register port to process component selections and format transformation of source and target operands in the instruction.
Therefore, the co-issue mechanism requires an additional check mechanism to determine the timing of co-issue rule. Furthermore, since source and target registers of the two instructions are different in the timing of co-issue rule, the consumption of register read ports and register write ports are at least doubled the amount. The number of the source modifier and instruction modifier are also at least doubled the amount.
Consequently, there is a need to develop a pixel processing system having an instruction folding mechanism for reducing the hardware cost and increasing performance of graphic processor unit.