This application is related to, and hereby incorporates by reference, the following U.S. patent applications:
System and Method for Daisy Chaining Cache Invalidation Requests in a Shared-memory Multiprocessor System, filed Jun. 11, 2001, attorney docket number 9772-0329-999;
Multiprocessor Cache Coherence System and Method in Which Processor Nodes and Input/Output Nodes Are Equal Participants, filed Jun. 11, 2001, attorney docket number 9772-0324-999;
Cache Coherence Protocol Engine and Method for Processing Memory Transaction in Distinct Address Subsets During Interleaved Time Periods in a Multiprocessor System, filed Jun. 11, attorney docket number 9772-0327-999; and
System And Method For Generating Cache Coherence Directory Entries And Error Correction Codes in a Multiprocessor System, U.S. Ser. No. 09/972,477, filed Oct. 5, 2001, which claims priority on U.S. provisional patent application No. 60/238,330, filed Oct. 5, 2000, which is also hereby incorporated by reference in its entirety.
The present invention relates generally to a multiprocessor computer system, and particularly to a multiprocessor system designed to reduce clock cycles required to switch from one memory transaction to another.
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
The present invention relates generally to a protocol engine for use in a multiprocessor computer system. The protocol engine, which implements a cache coherence protocol, includes a clock signal generator for generating signals denoting interleaved even clock periods and odd clock periods, a memory transaction state array for storing entries, each denoting the state of a respective memory transaction, and processing logic. The memory transactions are divided into even and odd transactions whose states are stored in distinct sets of entries in the memory transaction state array. The processing logic has interleaving circuitry for processing during even clock periods the even memory transactions and for processing during odd clock periods the odd memory transactions. Moreover, the protocol engine is configured to transition from one memory transaction to another in a minimum number of clock cycles. This design improves efficiency for processing commercial workloads, such as on-line transaction processing (OLTP) by taking certain steps in parallel.