1. Filed of the Invention
The present invention relates to a register-collecting mechanism, a method for performing the register-collecting mechanism and a pixel processing system employing the register-collecting mechanism, and more particularly to a register-collecting mechanism, a method for performing the register-collecting mechanism and a pixel processing system employing the register-collecting mechanism used in a three dimensional graphic processor unit (GPU).
2. Description of the Prior Art
Referring to FIG. 1, a conventional graphic processor unit 2 mainly comprises a triangle setup unit 23, a pixel processing unit 24 and a depth processing unit 25. The pixel processing unit 24 comprises a pixel shader 20, a texture unit 241 and a color interpolator 242 both connecting a pixel shader 20.
A surface of three-dimensional (3D) object is divided into a plurality of triangles two-dimensionally arranged in terms of their neighboring relationship and having an arbitrary size. The triangles each comprise three vertices. The vertices are forwarded to the triangle setup unit 23. The triangle setup unit 23 outputs the parameters of the pixels, such as the positions of the pixels in triangles and texture coordinates of the vertices of the corresponding triangles, to the pixel processing unit 24. In the pixel processing unit 24, based on the positions of the pixels and texture coordinates of the vertices, the texture unit 241 interpolates the texture coordinates for all the pixels. The interpolated texture coordinates of the pixels are input and then processed in the pixel shader 20 (with DirectX terms, or Fragment Processor in OpenGL terms). Next, the pixel shader 20 executes a texture load instruction, such as a texld instruction of DirecX, to return the processed texture coordinates to the texture unit 241. Based on the unprocessed texture coordinates and the processed texture coordinates, the texture unit 241 samples the texture colors of the pixels in a texture map and outputs the texture colors to the pixel shader 20. Meanwhile, based on the positions of the pixels and texture coordinates of the vertices, the color interpolator 242 interpolates the vertex colors for all the pixels and outputs the vertex colors of the pixels to the pixel shader 20. The pixel shader 20 processes the texture colors and the vertex colors of the pixels and outputs color value and depth value of the pixels to the depth processing unit 25, the final pixel colors are obtained. The final pixel colors are then available for drawing whole frame.
Referring to FIG. 2, the pixel shader 20 usually comprises four kinds of registers: temporary registers rn for storing temporary data, texture coordinates registers tn, textures numbering registers sn, vertex color registers vn, and outputting registers ocn for transforming the final pixel colors to the depth processing unit 25.
The process of the pixel shader 20 normally comprises four stages: a coordinate calculation stage, a texture processing stage, a color blending stage and an issue out stage. The interpolated texture coordinates of the pixels from the texture unit 241 are stored in the texture coordinates registers tn. In the coordinate calculation stage, the arithmetic, for the interpolated texture coordinates of the pixels from the texture unit 241, is conducted in the texture coordinates registers tn and the temporary registers rn, the arithmetic results, i.e. the processed texture coordinates, are stored in the temporary registers rn. In the texture processing stage, based on the texture coordinates in the registers tn and rn, the pixel shader 20 executes texture load instructions to require the texture unit 241 to sample texture colors of the pixels in a texture map. The texture map is appointed by the textures numbering registers sn. The sampled texture colors are transformed to the temporary registers rn. In the color blending stage, the pixel shader 20 blends the texture colors stored in the temporary registers rn with the vertex colors from the color interpolator 242 and the blending result is stored in the vertex color registers vn. In the issue out stage, the pixel shader 20 outputs color values and depth values of the pixels to the depth processing unit 25. It should be noted that the coordinate calculation stage, the texture processing stage and the color blending stage may be repetitiously processed or be omitted, respectively.
It is well known a second instruction usually has data dependency upon a first instruction and that execution of the first and second instructions during the same cycle is not possible. That is, when the second instruction uses the result of the first instruction, it can be executed only after the first instruction is completed. In pixel shader program, the execution latency of the texture load instruction is extremely long because it will involve several times of address transfers, memory accesses and color interpolations. And such a long latency becomes the most critical performance problem. To hiding such long latency, N pixels can be batch executed though the pipeline. If N can be equal or large than the latency multiplied by the pipeline throughput, the following instruction can be executed with no stall. However, the full register sets of the executed N pixels have to be temporarily stored in N register sets to wait for execution of the instructions. Therefore, the pixel shader 20 in the conventional GPU 2 is required to provide additional registers for temporarily storing the executed pixels. And the cost of the additional registers is so large that N never enough to hiding the long texture load latency.
To solve the above-mentioned problem, U.S. Pat. No. 5,652,774 discloses a central processing unit (CPU) having a rename register file comprising a plurality of rename registers to reduce the number of cycles required to execute instructions. The data processing method of the CPU includes a step of loading, in response to executing a first load instruction, data into the rename register file from a cache. The method further includes the steps of executing a second load register having a source register, and determining, during the execution of the subsequent instruction, that the requested data reside in a rename register of the rename register file. The method also includes the step of substituting the source register with the rename register containing the requested data. The rename register file has typically been used for allowing the conventional CPU to execute instructions in non-sequential fashion, thereby reducing the cycle times of the subsequent instructions. However, the CPU generally maintains the association between the rename register and the second subsequent instruction for a longer period. The rename register cannot be timely freed to execute a new instruction. Therefore, the conventional CPU causes a substantial increase in the number of the registers. Furthermore, reading/writing data from/into the cache also results in a long latency.
To solve the above-mentioned problem, U.S. Pat. No. 6,314,511 discloses an improved processor with a rename register file which comprises a plurality of rename registers. The processor employs an indicator to timely free a rename register from association with an old instruction, so the rename register is available to execute another instruction. However, processor is combined with the complicated out-of-order register-renaming mechanisms. In other words, after instructions are fetched and then decoded, the register-renaming mechanism is dynamically performed to rename the registers to index re-order buffers that only appear in out-of-order mechanisms. Therefore, the register-renaming mechanism for the out-of-order processing processor is more complicated than for the in-order processing processors.
Hence, an improved pixel processing system is desired to overcome the above-mentioned shortcomings.