Computer workloads, in some embodiments, run in a continuum from those having little inherent parallelism (being predominantly scalar) to those having significant amounts of parallelism (being predominantly parallel), and this nature may vary from segment to segment in the software. Typical scalar workloads include software development tools, office productivity suites, and operating system kernel routines. Typical parallel workloads include 3D graphics, media processing, and scientific applications. Scalar workloads may retire instructions per clock (IPCs) in the range of 0.2 to 2.0, whereas parallel workloads may achieve throughput in the range of 4 to several thousand IPC. The latter high IPCs may be obtainable through the use of instruction-level parallelism and thread-level parallelism.
Prior art microprocessors have often been designed with either scalar or parallel performance as the primary objective. To achieve high scalar performance, it is often desirable to reduce execution latency as much as possible. Micro-architectural techniques to reduce effective latency include speculative execution, branch prediction, and caching. The pursuit of high scalar performance has resulted in large out-of-order, highly speculative, deep pipeline microprocessors. To achieve high parallel performance, it may be desirable to provide as much execution throughput (bandwidth) as possible. Micro-architectural techniques to increase throughput include wide superscalar processing, single-instruction-multiple-data instructions, chip-level multiprocessing, and multithreading.
Problems may arise when trying to build a microprocessor that performs well on both scalar and parallel tasks. One problem may arise from a perception that design techniques needed to achieve short latency are in some cases very different from the design techniques needed to achieve high throughput.