Reconfigurable computing architecture with large amount of processing array can meet demand of computation power while keeping the power and silicon area efficient. Unlike field-programmable gate array (FPGA), Coarse-Grained Reconfigurable Architecture (CGRA) utilizes larger processing elements like arithmetic logic units (ALU) as its building blocks. It provides features of reconfigurability using high level language to quickly program the processing element (PE) array. One typical design of CGRA is shown in FIG. 1. It consists of a PE array, a configuration memory, a memory unit as a frame buffer and interconnections between the PEs and PEs to the frame buffer.
In general, CGRA is an approach to explore loop level parallelism. It is not specifically targeted to handle thread level parallelism. With any data dependency from one iteration to the next, the parallelism is largely limited. Therefore, the 2D array size is intended to be limited to an 8×8 PE array in most of the designs.
Graphics processing unit (GPU) architecture has provided a way to execute parallel threads in a Same Instruction Multiple Thread (SIMT) fashion. It is especially suitable for massive parallel computing applications. In these applications, typically no dependency is assumed between threads. This type of parallelism is beyond loop level parallelism within a software task, which CGRA is designed for. The thread level parallelism can be easily scalable beyond single core execution to multicore execution. The thread parallelism provides optimization opportunities and makes the PE array more efficient and more capable and it is easily made larger than 8×8. GPU, however, is not reconfigurable. Therefore, there is a need in the art to develop a next generation processor that can harness the processing capability of both CGRA and GPU.