Superscalar processors dispatch more than one instruction per cycle to improve performance. Unfortunately, such superscalar designs require escalating hardware costs that dilute the benefits of building wider processors. The problem is aggravated in speculative (typically out-of-order) processors that operate by dispatching more instructions per cycle than can be sustainably graduated. The problem is further exacerbated by Reduced Instruction Set Computer (RISC) instruction sets, which have very simple instructions but consequently require even wider machines to compete with corresponding Complex Instruction Set Computer (CISC) machines.
This is a significant problem in many microprocessors, but the problem is particularly acute in synthesized processors, where the frequency loss of building a wider machine can rival the throughput gain of doing so. Since frequency loss affects all programs and throughput gain only affects some, there is greater likelihood of an overall performance loss (because performance=throughput*frequency). Any method to obtain the benefits of higher throughput without hurting frequency are therefore welcome. A typical RISC processor usually has about 15% more dynamic instructions in the code stream to perform the same program as a comparable CISC processor. This instruction bloat does not hurt performance in the low-performance domain because the shorter pipelines and higher frequency benefits of RISC outweigh any instruction throughput disadvantages due to the code expansion. However, when striving for higher performance targets, a RISC processor must be designed to process more instructions per cycle. For example, the performance of a 3-wide CISC processor that can dispatch and graduate 3 instructions per cycle could not be equaled by a 3-wide RISC processor. Rather, a 4-wide RISC processor is required. This strategy works fine in a power-unconstrained industry, but the extra power of a 4 channel versus a 3 channel processor can be intolerable in power-sensitive markets.
It is well known that increasing dispatch width (i.e., degree of superscalarity of a processor) causes quadratic increases in register renamer complexity and area. Increasing dispatch width can also compromise frequency proportionally. Thus, any technique that can reduce the pressure to build a wider machine is welcome. In other words, it would be desirable to provide a technique to increase dispatch bandwidth in a RISC machine without the use of additional processing channels.