In recent years, graphic LSIs for performing 3D computer graphics by hardware at a high speed have spread remarkably. In particular, in game systems and personal computers (PC), such graphic LSIs are often mounted as standard equipment.
Further, the technological advances being made in graphic LSIs have been fast. Expansion of functions such as in the “Vertex Shader” and “Pixel Shader” employed in “DirectX” have been continuing and performance has been improved at a pace faster than that of the CPUs.
In order to improve the performance of a graphic LSI, it is effective not only to raise the operating frequency of the LSI, but also to utilize the techniques of parallel processing. The techniques of parallel processing may be roughly classified as follows.
First is a parallel processing method by area division, second is a parallel processing method at a primitive level, and third is a parallel processing method at a pixel level.
The above classification is based on the particle size of the parallel processing. The particle size of the area division parallel processing is the roughest, while the particle size of the pixel level parallel processing is the finest. These techniques will be summarized below.
Parallel Processing by Area Division
This is a technique dividing a screen into a plurality of rectangular areas and performing parallel processing while assigning areas to a plurality of processing units to take charge of.
Parallel Processing at Primitive Level
This is a technique of imparting different primitives (for example triangles) to a plurality of processing units and making them operate in parallel.
A view conceptually showing parallel processing at the primitive level is shown in FIG. 1.
In FIG. 1, PM0 to PMn−1 indicate different primitives, PU0 to PUn−indicate processing units, and MM0 to MMn−1 indicate memory modules.
When primitives PM0 to PMn−1 having a relatively equal large size are given to the processing units PU0 to PUn−1, the loads on the processing units PU0 to PUn−1 are balanced and efficient parallel processing can be carried out.
Parallel Processing at Pixel Level
This is the technique of parallel processing of the finest particle size.
FIG. 2 is a view conceptually showing parallel processing at the primitive level based on the technique of parallel processings at the pixel level.
As shown in FIG. 2, in the technique of parallel processing at the pixel level, when rasterizing triangles, pixels are generated in units of rectangular areas referred to as “pixel stamps PS” comprised of pixels arrayed in a 2×8 matrix.
In the example of FIG. 2, a total of eight pixel stamps from the pixel stamp PS0 to the pixel stamp PS7 are generated. A maximum of 16 pixels included in these pixel stamps PS0 to PS7 are simultaneously processed.
This technique has an efficiency in parallel processing better by the amount of fineness of the particle size in comparison with the other techniques.
In the case of parallel processing by area division explained above, however, in order to make the processing units operate in parallel efficiently, it is necessary to classify objects to be drawn in the areas in advance, so the load of the scene data analysis is heavy.
Further, when not starting drawing after one frame's worth of the scene data is all present, but drawing in the so-called immediate mode of starting drawing immediately when object data is given, the parallel property cannot be achieved.
Further, in the case of parallel processing at the primitive level, in actuality, there is variation in sizes of the primitives PM0 to PMn−1 composing the object, so a difference arises in the time for processing one primitive among the processing units PU0 to PUn−1. When this difference becomes large, the areas which the processing units draw in also largely differ and the locality of the data is lost, so for example the DRAM comprising the memory modules frequently makes page errors and the performance is lowered.
Further, in the case of this technique, there is also the problem of a high interconnect cost. In general, in hardware for graphics processing, in order to broaden the band width of the memory, a plurality of memory modules is used for memory interleaving.
At this time, as shown in FIG. 1, it is necessary to connect all processing units PU0 to PUn−1 and the built-in memory modules MM0 to MMn.
On the other hand, in the case of the parallel processing at the pixel level, as described above, there is the advantage that the efficiency of the parallel processing becomes better by the amount of fineness of the particle size. As the processing including actual filtering, processing is performed by the routine shown in FIG. 3.
That is, DDA (digital differential analyzer) parameters, for example, the inclinations of various types of data (Z, texture coordinates, colors, etc.) required for rasterization and other DDA parameters are calculated (ST1).
Next, texture data is read out from a memory (ST2), sub-word reallocation is performed (ST3), then the data is globally distributed to the processing units by a crossbar circuit (ST4).
Next, texture filtering is performed (ST5). In this case, the processing units PU0 to PU3 perform four-neighbor interpolation or other filtering by using read texture data and a decimal fraction obtained when calculating a (u, v) address.
Next, processing at the pixel level (per-pixel operation) is performed, specifically, texture data after the filtering and various types of data after rasterization are used for operations on pixel units (ST5).
Further, pixel data passing various types of tests in the processing at the pixel level is drawn in a frame buffer and Z-buffer on the memory modules MM0 to MM3.
By the way, memory access of the texture read system differs from memory access of the graphics generation system, therefore it is necessary to read data from a memory belonging to another module.
Therefore, for memory access of the texture read system, an interconnect such as a crossbar circuit as described above is necessary.
However, a related image processing apparatus, as described above, globally distributes data to the processing units, then performs texture filtering, so there are the disadvantages that the amount of data globally distributed is large (for example, 4 Tbps), the crossbar circuit serving as the global bus becomes large in size, and an increase in the speed of processing is obstructed from the viewpoint of interconnect delay etc.