High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1 GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without sacrificing server performance. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. In fact, the Alpha 21464 (the successor to the Alpha 21364) combines aggressive chip-level integration along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads.
Typical directory-based cache coherence protocols suffer from extra messages and protocol processing overheads for a number of protocol transactions. These problems are the result of various mechanisms used to deal with resolving races and deadlocks and the handling of xe2x80x9c3-hopxe2x80x9d transactions that involve a remote node in addition to the requester and the home node (where the directory resides). For example, negative-acknowledgment messages (NAKs) are common in several cache coherence protocols for dealing with races and resolving deadlocks, which occurs when two or more processors are unable to make progress because each requires a response from one or more of the others in order to do so. The use of NAKs also leads to non-elegant solutions for livelock, which occurs when two or more processors continuously change a state in response to changes in one or more of the others without making progress, and starvation, which occurs when a processor is unable to acquire resources.
Similarly, 3-hop transactions (e.g., requester sends a request, home forwards request to owner, owner replies to requester) typically involve two visits to the home node (along with the corresponding extra messages to the home) in order to complete the transaction. At least one cache coherence protocol avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. However, this cache coherence protocol places strict ordering requirements on the underlying transaction-message interconnect/network, which goes even beyond requiring point-to-point ordering. These strict ordering requirements are a problem because they make the design of the network more complex. It is much easier to design the routing layer if each packet can be treated independent of any other packet. Also, strict ordering leads to less than optimal use of the available network bandwidth.
The present invention also avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. Exceptions include read transactions that require two visits to the home node because of a sharing write-back that is sent back to the home node. However, the present invention does not place ordering requirements on the underlying transaction-message interconnect/network.
The present invention relates generally to a protocol engine for use in a multiprocessor computer system. The protocol engine, which implements a cache coherence protocol, includes a clock signal generator for generating signals denoting interleaved even clock periods and odd clock periods, a memory transaction state array for storing entries, each denoting the state of a respective memory transaction, and processing logic. The memory transactions are divided into even and odd transactions whose states are stored in distinct sets of entries in the memory transaction state array. The processing logic has interleaving circuitry for processing during even clock periods the even memory transactions and for processing during odd clock periods the odd memory transactions.