Larger and wider instruction windows, combined with out-of-order execution, have facilitated the exploitation of instruction level parallelism (ILP) over the past decade. Super scalar architectures have evolved towards higher issue widths and longer instruction windows in order to achieve higher instruction throughput by taking advantage of the ever-increasing availability of on-chip transistors. These trends are likely to continue with next generation multi-threaded processors, which allow for much better utilization of the resources in a wide issue super-scalar core. However, increasing the window size (e.g., going from a four-way issue to an eight-way issue processor) is not a trivial issue; it involves a lot of design and verification challenges.
It is well-known that current super-scalar organizations are approaching a point of diminishing returns. It is not trivial to change from a four-way issue to an eight-way issue architecture due to hardware complexity and implications in the cycle time. Nevertheless, the ILP that an eight-way issue processor can exploit far exceeds that provided by a four-way issue processor. In addition, the impact of wire delays, the increasing complexity of processor components, as well as power dissipation, constitute three important barriers for scaling up current super-scalar micro-architectures. Furthermore, the increasing complexity of some critical components, such as issue logic, bypass, register file and renaming logic, may have a direct influence on clock cycle time.
One of the proposed solutions to this problems is a technique referred to as clustering. In a clustered micro-architecture, some of the critical components are partitioned into simpler structures to reduce the impact of wire delays as far as signals that are kept within the clusters. Clusters offer the advantages of partitioned schemes where one can achieve higher rates of ILP and sustain a high clock rate. The partitioned architecture tends to make hardware simpler and control and datapaths faster. For instance, a partitioned architecture has fewer register file parts, fewer data bus sources/destinations and fewer alternatives for many control decisions.
Accordingly, clustering provides an alternative to designing wide and deep super-scalar processors by replacing them with a collection of smaller windows and associated functional unit clusters. Each cluster issues next execute instructions that are directed to it. Values produced within a cluster become available to a consumer within the cluster faster than to a consumer in another cluster. For remote clusters, a delay, called the “inter-cluster bypass latency” (ICBL), is paid. This latency across clusters is present due to the age wire delays that exist across current chips.
The processor resources required for an effective execution of a given application vary across different applications, and they also vary across different sections of the same application. As a result, certain applications will not utilize all processor resources, while such processor resources continue to consume power. As a result, clustered micro-architectures may consume inordinate amounts of power, which render such micro-architectures infeasible within energy-sensitive devices, such as portable or hand-held devices, which rely on an on-board power supply for operation. Therefore, there remains a need to overcome one or more of the limitations in the above-described, existing art.