The present invention relates generally to multiprocessor computer system, and particularly to a multiprocessor system designed to be highly scalable, using efficient cache coherence logic and methodologies.
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1 GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without sacrificing server performance. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. In fact, the Alpha 21464 (the successor to the Alpha 21364) combines aggressive chip-level integration along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads.
Typical directory-based cache coherence protocols suffer from extra messages and protocol processing overheads for a number of protocol transactions. These problems are the result of various mechanisms used to deal with resolving races and deadlocks and the handling of xe2x80x9c3-hopxe2x80x9d transactions that involve a remote node in addition to the requester and the home node (where the directory resides). For example, negative-acknowledgment messages (NAKs) are common in several cache coherence protocols for dealing with races and resolving deadlocks, which occurs when two or more processors are unable to make progress because each requires a response from one or more of the others in order to do so. The use of NAKs also leads to non-elegant solutions for livelock, which occurs when two or more processors continuously change a state in response to changes in one or more of the others without making progress, and starvation, which occurs when a processor is unable to acquire resources.
Similarly, 3-hop transactions (e.g., requestor sends a request, home forwards request to owner, owner replies to requester) typically involve two visits to the home node (along with the corresponding extra messages to the home) in order to complete the transaction. At least one cache coherence protocol avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. However, this cache coherence protocol places strict ordering requirements on the underlying transaction-message interconnect/network, which goes even beyond requiring point-to-point ordering. These strict ordering requirements are a problem because they make the design of the network more complex. It is much easier to design the routing layer if each packet can be treated independent of any other packet. Also, strict ordering leads to less than optimal use of the available network bandwidth.
The present invention also avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. Exceptions include read transactions that require two visits to the home node because of a sharing write-back that is sent back to the home node. However, the present invention does not place ordering requirements on the underlying transaction-message interconnect/network.
In summary, the present invention is a system including a plurality of processor nodes configured to execute a cache coherence protocol that avoids the use of NAKs and ordering requirements on the underlying transaction-message interconnect/network and services most 3-hop transactions with only a single visit to the home node. Each node has access to a memory subsystem that stores a multiplicity of memory lines of information and a directory. Additionally, each node includes a memory cache for caching a multiplicity of memory lines of information stored in stored in a memory subsystem accessible to other nodes. Further, a protocol engine is included in each node to implement the negative acknowledgment free cache coherence protocol. The protocol engine itself includes a memory transaction array for storing an entry related to a memory transaction, which includes a memory transaction state. A memory transaction concerns a memory line of information and includes a series of protocol messages, which are routed both within a given node and to other nodes. Also included in the protocol engine is logic for processing memory transactions. This processing includes advancing the memory transaction when predefined criteria are satisfied (e.g., receipt of a protocol message) and storing an updated state of the memory transaction in the memory transaction array.