In recent years, graphic LSIs for executing 3D computer graphics by hardware at a high speed have spread remarkably. In particular, in game systems and personal computers (PC), such graphic LSIs are often mounted as standard equipment.
Further, the technological advances being made in graphic LSIs have been fast. Expansion of functions such as in the “Vertex Shader” and “Pixel Shader” employed in “DirectX” have been continuing and performance has been improved at a pace faster than that of the CPUs.
In order to improve the performance of a graphic LSI, it is effective not only to raise the operating frequency of the LSI, but also to utilize the techniques of parallel processing. The techniques of parallel processing may be roughly classified as follows.
First is a parallel processing method by area division, second is a parallel processing method at a primitive level, and third is a parallel processing method at a pixel level.
The above classification is based on the particle size of the parallel processing. The particle size of the area division parallel processing is the roughest, while the particle size of the pixel level parallel processing is the finest. These techniques will be summarized below.
Parallel Processing by Area Division
This is a technique dividing a screen into a plurality of rectangular areas and performing parallel processing while assigning areas for a plurality of processing units to take charge of.
Parallel Processing at Primitive Level
This is a technique of imparting different primitives (for example triangles) to a plurality of processing units and making them operate in parallel.
A view conceptually showing parallel processing at the primitive level is shown in FIG. 1.
In FIG. 1, PM0 to PMn−1 indicate different primitives, PU0 to PUn−1 indicate processing units, and MM0 to MMn−1 indicate memory modules.
When primitives PM0 to PMn−1 having a relatively equal large size are given to the processing units PU0 to PUn−1, the loads on the processing units PU0 to PUn−1 are balanced and efficient parallel processing can be carried out.
Parallel Processing at Pixel Level
This is the technique of parallel processing of the finest particle size.
FIG. 2 is a view conceptually showing parallel processing at the primitive level based on the technique of parallel processings at the pixel level.
As shown in FIG. 2, in the technique of parallel processing at the pixel level, when rasterizing triangles, pixels are generated in units of rectangular areas referred to as “pixel stamps PS” comprised of pixels arrayed in a 2×8 matrix.
In the example of FIG. 2, a total of eight pixel stamps from the pixel stamp PS0 to the pixel stamp PS7 are generated. A maximum of 16 pixels included in these pixel stamps PS0 to PS7 are simultaneously processed.
This technique has an efficiency in parallel processing better by the amount of fineness of the particle size in comparison with the other techniques.
In the case of parallel processing by area division explained above, however, in order to make the processing units operate in parallel efficiently, it is necessary to classify objects to be drawn in the areas in advance, so the load of the scene data analysis is heavy.
Further, when not starting drawing after one frame's worth of the scene data is all present, but drawing in the so-called immediate mode of starting drawing immediately when object data is given, the parallel property cannot be achieved.
Further, in the case of parallel processing at the primitive level, in actuality, there is variation in sizes of the primitives PM0 to PMn−1 composing the object, so a difference arises in the time for processing one primitive among the processing units PU0 to PUn−1. When this difference becomes large, the areas which the processing units draw in also largely differ and the locality of the data is lost, so for example the DRAM comprising the memory modules frequently makes page errors and the performance is lowered.
Further, in the case of this technique, there is also the problem of a high interconnect cost. In general, in hardware for graphics processing, in order to broaden the band width of the memory, a plurality of memory modules is used for memory interleaving.
At this time, as shown in FIG. 1, it is necessary to connect all processing units PU0 to PUn−1 and the built-in memory modules MM0 to MMn−1.
Further, in the case of parallel processing at the pixel level, the size of the primitive processed by the graphics tends to become smaller, therefore even if the size of the pixel stamp is made larger, the result is only that invalid pixels increase. The execution efficiency does not rise.
Accordingly, there is a limit to the number of pixels which can be parallel processed by this technique.
Therefore, it can be considered to employ the following parallel processing technique based on the characteristics of the above parallel processing techniques.
Namely, from the viewpoint of the particle size of the parallel processing, parallel processing at the pixel level is employed.
However, since there is a limit to the degree of parallelism in parallel processing at the pixel level, a plurality of primitives are simultaneously processed.
Further, the processing is distributed to enable the processing units and the memory modules to be directly connected so as to lower the interconnect cost.
FIG. 3 is a view conceptually showing a configuration of the above image processing apparatus employing parallel processing at the pixel level, simultaneously processing a plurality of primitives, and distributing processing to enable the processing units and the memory modules to be directly connected.
An image processing apparatus 1 has, as shown in FIG. 3, a streaming data controller (SDC) 2 and n number of modules 3-0 to 3-n−1 connected in parallel with respect to the SDC 2.
The modules 3-0 to 3-n−1 include processing units PU (0 to n−1) and memory modules MM (0 to n−1).
In FIG. 3, it is assumed that the memory modules MMO to MMn−1 are interleaved.
Further, the SDC 2 controls transfer of data with the CPU and an external memory and generates parameters required for the rasterization in the processing units PU0 to PUn−1. These parameters are supplied to all processing units PU0 to PUn−1.
The processing units PU0 to PUn−1 have corresponding memory interleaves determined in advance and generate only pixels which they are in charge of.
FIG. 4 is a view for explaining pixel generation in the image processing apparatus of FIG. 3.
In the example of FIG. 4, the memory is interleaved in units of 4×4 rectangular areas. These rectangular areas will be referred to as “Lstamps (Large stamps)”. Here, the number of modules configured by the processing units PU and the memory modules MM shown in FIG. 3 is assumed to be four (3-0 to 3-3). Further, the numerals of the modules and the numerals given to the rectangular areas in FIG. 4 correspond to each other.
Accordingly, in the case of the triangle shown in FIG. 4, the module 3-0 is in charge of two Lstamps, while the modules 3-1 to 3-3 are respectively in charge of three Lstamps.
As will also be understood from FIG. 4, an Lstamp includes many invalid pixels. Therefore, in the image processing apparatus 1 of FIG. 3, it is also possible to further divide the Lstamp into four 2×2 rectangular areas (Small stamps: Sstamps) and process only small stamps Sstamps including valid pixels at a latter stage.
The technique for achieving parallelism in FIG. 3 enables processing to be distributed to enable the processing units and the memory modules to be directly connected in the drawing system of graphic processing and can reduce the interconnect cost.
However, the memory access of the texture read system is different from the memory access of the drawing system, so it becomes necessary to read from a memory belonging to another module.
Accordingly, an interconnect such as some sort of a crossbar is needed for memory access of the texture read system, but when realizing this, the following problems of overlapping access and scalability must be solved.
Overlapping Access
In general, when projecting rectangular areas on a screen into a texture space, they occupy areas having different shapes and sizes.
FIG. 5A and FIG. 5B are views conceptually showing a situation of projecting rectangular areas on a screen into texture space.
Here, it is assumed that the texture data is equally stored in the interleaved memory in the same way as the data of the frame buffer and the z-buffer.
Namely, 4×4 rectangular areas are stored in the memory in modules corresponding to the attached numerals. In the figures, a situation wherein four 4×4 rectangular areas adjacent in the screen space shown in
FIG. 5A are projected into the texture space shown in FIG. 5B is shown. For example, the area described as the “Area for Module 3-0” in FIG. 5B indicates the texture data mapped with respect to 4×4 rectangular areas on the screen space in the charge of the module 3-0.
In this case, as shown in FIG. 5A and FIG. 5B, a plurality of other modules are accessed for the texture data stored in a certain module. In addition, the timings for access become different.
Accordingly, when the unit of transfer among modules is a 4×4 texcel unit, the data indicated by the hatching in the texture space of FIG. 5B will be transferred four times.
The transfer of the same data a plurality of times in this way wastes the memory band width constrained according to electrical and physical conditions and invites a drop in performance.
Scalability
FIG. 6 is a view of an example wherein a crossbar which becomes necessary for memory access of a texture read system is realized by the most direct form in a system for parallel processing by a plurality of processing units.
In FIG. 6, a global access bus indicated by “GAB” corresponds to the crossbar for the texture access.
When the number of bits of data read out from one memory module is M bits and the number of parallel processed modules is n, the number of data interconnections required for the texture access is M×n.
When realizing the system shown in FIG. 6 by an LSI, not only are the above (M×n) number of data lines laid in a lateral direction across all modules, but also interconnects in the vertical direction become necessary for connection with the processing units.
A crossbar having such a form causes the disadvantages in realizing scalability such as the interfaces of the processing units, interconnect costs, and interconnect delay.
Processing Unit Interfaces
As indicated by the broken line ovals in FIG. 6, the processing units PU-0 to PUn−1 in the modules 3-0 to 3-n−1 require interfaces for receiving the data read out from n number of memory modules MM0 to MMn−1 and selecting required data.
Such interfaces depend upon the number of integrated modules, so must be re-designed whenever the degree of parallelism is changed.
Interconnect Costs
In an ideal state where 100% the read texture data is effectively utilized without having to consider the above overlapping access or increase of band width caused by filtering, the minimum required number of interconnects of the global access bus GAB is calculated.
Now, it is assumed that each processing unit processes 4 pixels per cycle and can simultaneously access four textures.
At this time, even in a state where only one texcel is required per pixel on the average, a band width of 4×4×32=512 bits is required per module. Accordingly, the interconnect width WB of the global access bus GAB in the case of a system integrating n number of modules is found as follows:WB=(512×N×WP)/NL 
Here, WP is the interconnect pitch, and NL is the number of layers used for the interconnect. For example, as a reference value in an assumed 0.07p process, when WP=0.9 μand NL=2 are applied, the interconnect width when changing the number of modules becomes as follows:                4 modules: 0.9 mm,        8 modules: 1.8 mm, and        16 modules: 3.6 mm.        
When assuming that the chip size is 10 mm square, in the case of 8 modules, 20% of the chip size is used only by the lateral interconnects of the global access bus GAB.
In addition, these values assume the minimum required interconnects in the ideal state. In practice, the number must be at least double for practical performance.
Interconnect Delay
When the process is miniaturized and the operating frequency rises, the delay due to the interconnects becomes dominant. For example, in the case of 1 GHz operation, the length able to be laid is the to be about 1 mm to 2 mm.
Accordingly, in the case of the global access bus GAB shown in FIG. 6, sometimes the length from one end of the bus to the other end exceeds 10 mm. As a result, a pipeline register is required for the interconnects and the read latency increases.
In addition, the interconnect length differs according to the combination of the processing units and the memory modules, therefore there is also the problem that the timing for each combination must also be optimized and design becomes difficult.