Clustered very large instruction word (VLIW) processors and runtime reconfigurable fabrics such as field programmable gate arrays (FPGAs) are all geared toward efficiently executing programs with high instruction level parallelism (ILP), typically loops in streaming application (video, audio, modulation, channel coding, encryption, etc.). Embedded multimedia applications, for instance, exhibit high ILP—as much as 10 to 100 instructions per cycle—in a small fraction of the total program code, often quoted as 10-20%, and low ILP—about 2—in the rest of the program, the remaining 80-90%. However, on a conventional processor, the high ILP portion of the program code (e.g., the loops) will consume a high fraction of the total execution time, reportedly 80-90%.
VLIW and superscalar architectures, which can issue several instructions simultaneously or for concurrent execution, have been proposed to accelerate code with high ILP. As the issue width increases, mounting design constraints—primarily related to timing—on the register file have led to clustered architectures where the registers are split among several physically distinct files around which execution units are clustered. An inter-cluster bus is also added to enable data transfer between register files.
In general purpose clustered architectures, however, significant silicon overhead and performance penalty is incurred by the inter-cluster bus necessary for transferring data between register files. Additionally, for VLIW architectures that are statically scheduled, program size overhead is also incurred due to the addition of send/receive instructions to transfer data between register files.
In practice, therefore, the inter-cluster bus overhead effectively limits the number of clusters to about four. Assuming each cluster has an issue width of four instructions, the maximum issue width of this architecture is sixteen, a typical upper limit for clustered VLIW machines. In addition, even more so than other programmable processors, clustered VLIW architectures incur a significant power dissipation penalty compared to a fixed hardware implementation because of the inter-cluster bus and because of the unused clusters when the ILP of the program code is too small. In fact, for the archetypical multimedia embedded program, a clustered architecture arguably has too little parallelism for the high ILP sections of the code and too much for the remainder of the code.
Another approach to accelerating ILP rich programs, mostly pursued in academia, is the use of a coprocessor using reconfigurable logic and a reconfigurable interconnect. However, this approach has its own problems. So-called Runtime Reconfigurable CoProcessors (RRCPs) based on a Field Programmable Gate Array (FPGA) or similar fabric have limitations due to poor programming abstraction, which makes them difficult to program and non-scalable. The poor scalability stems from the fact that these architectures typically have global routing resources. As the size of the coprocessor increases, the timing along the global routing limits the speed of the coprocessor, not unlike the inter-cluster bus for clustered architectures.
The poor programming model is a consequence of the absence of virtualization of the computing resources. This refers to the number of computing resources (arithmetic logic units, multipliers, etc.) that are typically visible in the programming abstraction. The consequence is that a program segment requiring more resources than are available cannot be accelerated on the RRCP. In addition, reconfiguration phases also generally show up in the programming model, adding to the complexity of the programming model.
Additional drawbacks of RRCPs include the significant amount of time taken by reconfiguration, the amount of resources dedicated to reconfiguration (such as a reconfiguration bitstream cache), the use of Register Transfer Level (RTL) or dedicated languages to program reconfiguration resources, and the area required and performance penalty incurred by reconfigurable logic. Various projects in academia, and some commercial offerings, have proposed partial solutions to the above limitations, but none have addressed all of the problems together.
There is, therefore, a need in the art for a general purpose programmable architecture that can dynamically allocate an adequate amount of computing resources based upon the amount of ILP within a program segment being executed for both performance and power dissipation—that is, to execute high ILP kernels very fast while executing low ILP program segments very power efficiently. There is further a need for a coprocessor architecture that can offer the performance of hardwired or reconfigurable logic while providing the familiar abstraction and scalability of a general-purpose processor.