Course grained reconfigurable architectures (CGRA) in computer systems have been available. As known, a CGRA is an array of light weight functional units called processing elements (PEs) which are interconnected with each other via some sort of interconnect network (mesh, hypercube, star, NoC, . . . ). The dataflow architecture of the CGRAs make then very well suited for accelerating (innermost) loops because they can very effectively utilize spatial and temporal parallelism often found in such loops.
The coarse grain datapath, rich point-to-point interconnects and abundant, albeit distributed register files make them very competitive when compared to other accelerator architectures (VLIW, FPGA, and GPU). For example, with respect to data-parallel vs. non-data parallel performance, comparing CGRA's to GPU's it is noted that GPUs can only accelerate data-parallel loops (exploiting DLP (data level parallelism) and TLP (thread level parallelism)) while CGRAs can exploit DLP, TLP and instruction level parallelism ILP (across loop iterations) to even accelerate loops that have no data parallelism.
With respect to a programming model, since GPUs can only exploit DLP and TLP, this implies that significant code rewrite may be needed in order to accelerate the application on a GPU—incurring significant software development and debug costs. For a CGRA, it is possible to simply annotate portions of the application and the compiler then maps the application without having to rewrite the code.
Further, in order to get significant application performance gains on a GPU, the loop trip count needs to exceed 10000 or so. In a CGRA, the loop trip count can be as low as 100 for the application to be accelerated.
Finally, GPUs can only accelerate loops where the trip count is known in advance of loop execution. However CGRAs can accelerate loops that have data dependent exit conditions (while, break, continue).
However, there are limitations and challenges in CGRA implementations: Often CGRAs are studied without a realistic load-store unit (LSU) which in reality has a significant effect on performance. This is especially important if CGRAs are to be considered in the context of a cache coherent accelerator. CGRAs, unlike GPUs which employ the warping concept, have no mechanism for hiding the memory latency. Hence data cache misses can impose severe performance penalty. Further, CGRA architectures do not provide a mechanism for checkpoint, recovery and rollback.
Current CGRA architectures do not provide support for loop-related hardware features such as ability to exit based on data-dependent condition, breaks, etc. Moreover, with respect to loop-related hardware features, in a CGRA, loop execution acceleration is achieved through compiler-assisted placement of loop instructions onto an array of processing engines (PEs/ALUs). This is referred to as a “Static Placement” distinguishing it from “Dynamic Placement” which is typically employed at run-time and is very common in Out-Of-Order processors.
For CGRAs with a large number of PEs, it is desirable to have a very high degree of instruction-level parallelism (ILP) to keep the PE's occupied.
Traditionally, this ILP is achieved by compiler-mediated placement of instructions (from other loop iterations)—also commonly referred to as Modulo Scheduling.
However, traditional Modulo Scheduling has many disadvantages. For example, although Modulo Scheduling helps keep the PE's occupied, one disadvantage is: 1) that Compiler assumed latencies (for modulo scheduling) often differ from runtime latencies due to the unpredictable nature of load and store instructions in CMPs (shared memory systems). This makes runtime performance suboptimal; 2) For loops that have loop-carried memory dependencies, it is possible to have store-hit-load (LSU) violations across loop iterations. This will cause a later iteration to be flushed and subsequently re-executed. Since instructions from these iterations are intermingled with each other, this imposes additional complexity on the predecoder/execution engine to selectively replay only the flushed iteration; and 3) the degree of modulo scheduling (=the number of loop iterations in flight) is decided at compile time. However, it may be optimal at run-time to choose fewer loop iterations in flight due to dependency or other constraints.