1. Field of the Invention
This application relates to complex microprocessor design and, specifically, to chip-multiprocessor design with two-level caching.
2. Background Art
High-end microprocessor designs are becoming increasingly complex, with designs continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. Associated with such complexity are higher development costs and longer design times. Meanwhile, such designs are not suited for important commercial applications, such as on-line transaction processing (OLTP) because they suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors targeted at commercial servers.
Commercial workloads such as databases and world-wide web (Web) applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. A number of recent studies have underscored the radically different behavior of commercial workloads such as OLTP relative to technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates which are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in current microprocessors. Therefore, it is not uncommon for a high-end microprocessor to be stalling most of the time while executing commercial workloads, leading to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
However, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. While the SMT approach is superior in single-thread performance (important for workloads without explicit thread-level parallelism), it is best suited for very wide-issue processors which are more complex to design. In comparison, CMP advocates using simpler processor cores at a potential loss in single-thread performance, but compensates in overall throughput by integrating multiple such cores. Furthermore, CMP naturally lends itself to a hierarchically partitioned design with replicated modules, allowing chip designers to use short wires as opposed to costly and slow long wires that can adversely affect cycle time.
Accordingly there is a need for to build a system that achieves superior performance on commercial workloads (especially OLTP) with a smaller design team, more modest investment, and shorter design time. The present invention addresses these and related issues.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method for scalable chip-multiprocessing and a chip-multiprocessor (CMP) system an example of which is referred to as the PIRANHA(trademark) system (PIRANHA(trademark) is a trademark of Compaq Computer Corporation, Houston Tex.). The design of this CMP system employs chip multiprocessing architectures targeted preferably at parallel commercial workloads. Then, as a functional and architectural strategy, the present invention exploits chip-multiprocessing (CMP) in integrating a plurality of simpler processor cores, such as ALPHA(trademark) processor cores, along with a two-level cache hierarchy on a single chip (ALPHA(trademark) is a trademark of Compaq Computer Corporation, Houston Tex.). The use of simpler processor cores combined with an industry-standard ASIC design methodology allow completion of a CMP prototype such as the PIRANHA(trademark) system within a shorter time and with a smaller team and investment than that of current commercial microprocessor.
Preferably, the PIRANHA(trademark) architecture includes a highly-integrated processing node with eight simpler ALPHA(trademark) processor cores, separate instruction and data caches for each core, a shared second-level cache, eight memory controllers, two coherence protocol engines, and a network router all on a single chip die. Multiple such processing nodes can be used to build a glueless multiprocessor in a modular and scalable fashion. Hence, the PIRANHA(trademark) architecture includes also functionality for allowing the scalable multiprocessor configurations to be built in the glueless and modular fashion.
As a further design strategy, the shared second-level cache has a relaxed or non-inclusion property. Moreover, the PIRANHA(trademark) system incorporates a highly-optimized cache coherence protocol and a novel input/output (I/O) architecture. The design of the shared second-level cache uses a sophisticated protocol that does not enforce inclusion in first-level instruction and data caches in order to maximize the utilization of on-chip caches. Additionally, the CMP system includes an I/O node, a unique I/O architecture, that is a full-fledged member of the interconnect and global shared-memory coherence protocol.
Advantages of the invention will be understood by those skilled in the art, in part, from the description that follows. Advantages of the invention will be realized and attained from practice of the invention disclosed herein.