The present invention relates generally to multiprocessor computer system, and particularly to a multiprocessor system designed to be highly scalable, using efficient cache coherence logic and methodologies.
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modem microprocessors. Therefore, it is not uncommon for a high-end microprocessor to stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1 GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without sacrificing server performance. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. In fact, the Alpha 21464 (the successor to the Alpha 21364) combines aggressive chip-level integration along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads.
Typical invalidation and directory-based cache coherence protocols suffer from extra messages and protocol processing overheads for a number of protocol transactions. In particular, before a processor may write to a memory location, all the cached copies of that memory location must be invalidated to ensure that only up-to-date copies of the memory location are used. There may be a large number of cached copies of the memory location, so an equally large number of invalidation requests may have to be transmitted at virtually the same time. A large number of invalidation requests leads to delays or serialization bottlenecks, when the invalidation requests are transmitted and when invalidation acknowledgments are transmitted.
In summary, the present invention is a protocol engine for use in a multiprocessor computer system having a plurality of nodes. Each node includes an interface to a local memory subsystem, the local memory subsystem storing a multiplicity of memory lines of information and a directory, and a memory cache for caching a multiplicity of memory lines of information, including memory lines of information stored in a remote memory subsystem that is local to another node. The directory includes an entry associated with a memory line of information stored in the local memory subsystem. The directory entry includes an identification field for identifying sharer nodes that potentially cache the memory line of information.
The protocol engine is configured to format the identification field of a directory entry as a coarse vector, comprising a plurality of bits at associated positions within the identification field. The protocol engine associates with each respective bit of the identification field one or more nodes, including a respective first node. The nodes associated with each respective bit are determined by reference to the position of the respective bit within the identification field. The protocol engine furthermore sets each bit in the identification field for which the memory line is cached in at least one of the associated nodes.
In response to a request for exclusive ownership of a memory line, the protocol engine sends an initial invalidation request to no more than a first predefined number of the nodes associated with set bits in the identification field of the directory entry associated with the memory line.
In accordance with another aspect of the present invention, each of the nodes to which the initial invalidation request is sent forwards the invalidation request to another node, if any, that is a member of a sub-group of sharer nodes identified within the initial invalidation request. Those nodes, in turn, forward the invalidation request to yet other nodes, until the invalidation request is sent to all the sharer nodes identified in the initial invalidation request. The last nodes to receive the invalidation request send acknowledgments to the requesting node.
In a preferred embodiment, the protocol engine is further configured to format the identification field of a directory entry in a limited pointer format when the number of nodes sharing the memory line corresponding to the directory entry is fewer than a second predefined number of nodes. When using the limited pointer format, the protocol engine stores in the identification field of the directory entry one or more node identifiers that identify nodes in which the memory line is cached. Furthermore, the protocol engine sends an invalidation request to no more than the first predefined number of the nodes whose node identifiers are stored in the identification field of the directory entry associated with the memory line.