1. Field of the Invention
This invention relates generally to microprocessors and more particularly to very large instruction word (VLIW) microprocessors having multiple clusters of functional units.
2. Description of the Related Art
High performance VLIW central processing units (CPUs) with multiple functional units are designed to obtain improved processing performance by executing code which has high instruction level parallelism (ILP). Clustered VLIW machines (CVLIW) further divide the CPU architecture into clusters which each contain one or more functional units and a separate register file. Instructions in the code are divided into sub-instructions which are input to each cluster and which may be executed in parallel.
FIG. 1 shows a conventional CVLIW machine 10. CVLIW 10 includes a main instruction cache 20 and a main data cache 80 which are shared by clusters 50A-D. Wide instructions within main instruction cache 20 are fed into instruction register 30 with each cluster 50A-D getting a fixed fraction, called a c-instruction, of the wide instruction. Each c-instruction is loaded into one of subregisters 30A-D for input to a corresponding cluster 50A-D. The source register operand specifiers in the instructions seen by each cluster 50A-D specify registers within one of register files 52A-D that is local to that cluster. There can be communication between clusters by movement of register contents, either with an explicit move instruction or by implicit destination cluster specifiers along with the register field.
Each cluster 50A-D contains a local register file 52A-D feeding operands to one or more local execution units 54A-D to perform an operation, i.e. an alu operation, load store, branch, etc. The results generated by the operations performed in each of the clusters 50A-D are then stored in main data cache 80.
CVLIW machines achieve high levels of performance when the ILP is high. However, these CPUs are often interrupted by tasks that are not performance critical or have low ILP. High ILP code can also make procedure or system calls to code of known low ILP. During a high ILP code sequence, all clusters 50A-D of the machine 10 are being utilized and there are many live registers in register files 52A-D. The main instruction cache 20 and data cache 80 have been primed with the necessary instructions and data operands and cache localities typically contribute to relatively fast instruction fetch and data load/store operations. However, when an interrupt or system call involving a low ILP routine occurs, then the entire width of CPU 10 is typically available to the low ILP process even though it is not useful to allocate the full width of the processor to the low ILP task.
To preserve the contents of the registers in register files 52A-D, the interrupt routine and system call will typically store the register contents at the beginning of the interrupt of system call and restore the registers upon completion. As a result, the CPU suffers the overhead of saving and restoring a large number of registers at each interruption or system call even though it is not useful to allocate the full width of CPU 10 to the low ILP process. The performance penalty is particularly important in the CVLIW class of microprocessors which have a large number of live registers distributed among multiple register files 52A-D.
Processor performance is further impacted because low ILP regions will pollute the main instruction cache 20 by overwriting instructions for the high ILP code which are resident in main instruction cache 20 so that performance is reduced when execution flow returns to the high ILP code region. Similarly, the low ILP regions can overwrite data in main data cache 80 which is needed by the high ILP regions and which must then be restored by when execution flow returns to the high ILP code region.
Conventional CPU architectures, such as the PA-RISC architecture from Hewlett-Packard (HP), have attacked the register save/restore problem by designating a subset of registers as special shadow registers with dual internal state. The low ILP code regions are limited by convention to using this subset of the total register space and register pollution is avoided. However, this technique requires the designation of privileged and non-privileged tasks and cannot be used in any non-privileged, and therefore non-trusted, tasks.
Another way that a register file save and restore can be avoided is by using a register stack as done in the PAWW processor from Hewlett-Packard (HP). However, the register stack requires the inclusion of register renaming, with an attendant increase in complexity, area and execution time in the front end of the CPU pipeline. Also, there is high overhead in supporting the virtual to physical translation required for the register stack and in supporting bounds checking and other housekeeping functions that must be included to accommodate the use of the register stack. Furthermore, register stack overflow and underflow must be provided for which results in still more code overhead and complexity.
In addition, neither the save and restore solution nor the shadow register set solution employed in conventional CPU architectures prevent pollution of the main instruction cache 20 and main data cache 80.
Accordingly, the need remains for a CPU capable of executing both high ILP and low ILP code regions without loss of performance due to pollution of registers, instruction cache and data cache.