In digital signal processing, a processor such as DSP which operates through programming, must to be able to handle various algorithms. In recent years processing volume has increased significantly for processes such as graphics processing etc. For that reason, devices with a plurality of processors such as GPU (graphics processing units) have come to replace DSPs.
Prior example is illustrated using the example of GPU. FIG. 2 shows an implementation of GPU as claimed by Japanese Unexamined Patent Application Publication No. 2009-37593. This GPU consists of an external memory unit (EMU) 201, external memory 202, a vector processing engine (VPE) 205, a vector control unit (VCU) 206, a vector processing unit (VPU) 207.
The vector processing unit (VPU) 207 consists of a plurality of computing units which form the core of a multiprocessor. A higher level control unit, the vector control unit (VCU) 206 along with the VPU forms a set to make the vector processing engine (VPE) 205. The vector processing engine (VPE) 205 further consists of several engines and is crossbar connected through the external memory unit (EMU) 201 which is mutually accessible by all the engines. Further they are also connected to external memory 202 so that memory access is possible.
Data as well as commands which are units of program, are stored in L1 cache of the vector processing unit (VPU) 207 (lowest level 1 cache or temporary storage unit) or L2 cache of the vector processing engine (VPE) 205 (upper level 2 cache or temporary storage unit). Each is a tiered memory access flow.
The more vector processing engines (VPE) 205 are implemented, the more the performance will rise. However the operation of the vector processing engine (VPE) 205 is to execute same instruction at the same time, basically SIMD (single instruction multiple data) type so when the number of the implemented engines increase, memory access becomes concentrated at the same time and due to the bandwidth restrictions of the external memory unit (EMU) 206 or external memory 202, the performance deteriorates. Therefore it is essential to limit the implementation of the vector processing unit (VPU) 207, and instead increase the number of the vector processing engines (VPE) 205. By executing different programs or staggering the time frame of execution of program in the vector processing engine (VPE) 205, the problem of accesses concentrating at the same time can be avoided.
On the other hand, each of the vector processing engine (VPE) 205 will be coupled through the external memory unit (EMU) 201 and will require a mechanism to efficiently exchange data. Data exchange is required to break up a large program into smaller ones (to increase the efficiency of the vector processing unit (VPU) 207). In order to escape the complexity of programming, these data exchanges are automatically done independent of the program.
Method for automatically performing data exchange is described. First need to define memory access request, transfer source and transfer destination message. Each device issues or receives each of the above and processes it. Multiple messages are arbitrated at each device and are processed in parallel honoring the order. Messaging operations are mainly processed by a DMA (Direct Memory Access) device in each of the vector processing engine (VPE). Through this arrangement, data transmission between External Memory 202 and L1 cache of the vector processing unit (VPU) 207 or L2 cache of the vector processing engine (VPE) 207 is automatically accomplished. This process is controlled separately and does not require attention of program.
Next will explain the vector processing unit (VPU) 207.
When thinking of SIMD type, it is better to keep in order the number of circuits by compressing them and simplifying the scale of circuit of every unit. Further by simplifying the circuit, can operate at high frequency. Thus, 1 unit will be a simple pipeline structure, and unlike a general purpose processor does not feature high level functions. For example, methods which cost in implementation such as superscalar are not used.
When the structure of processor is simplified, flow dependency becomes a problem. For example, when reading from register A performing pipeline processing and writing to register B, it is necessary to wait till register B write is complete before the next command can read register B. In a large enough program, it is possible to avoid this waiting penalty by scheduling the order of commands, but when the program is smaller due to it being split and distributed, then such scheduling is difficult.
Therefore, it becomes necessary to stop the execution of read for register B for next command until the write to register B of previous command has completed (hazard occurrence). FIG. 3 shows occurrence of hazards when performing pipeline operations as a function of time, and shows the hazards which occur when write of command 1 and read of command 2 use the same register.
To solve this in traditional processors, it was necessary to see that flow dependency does not arise and to run different programs which have completely different contexts as shown in FIG. 4. In FIG. 4, programs ABCD are prepared, and each command is executed one after another. By having different registers for ABCD even when program A writes to a register, other programs BCD read registers do not overlap and can be accessed. Further there is time difference of 3 between program A command A1 and command A2 so even though there is flow dependency hazards do not occur.
Thus, a number of simplified structured processors can be combined, and connected crossbar with commands and data queued and by tweaking the program, can increase the efficiency of the processor as well as distributed usage of memory bandwidth.