The present invention relates generally to multiprocessor computer system, and particularly to a multiprocessor system designed to be highly scalable, using efficient cache coherence logic and methodologies that implement store-conditional memory transactions when an associated directory entry is encoded as a coarse bit vector.
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to-stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1 GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without sacrificing server performance. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. In fact, the Alpha 21464 (the successor to the Alpha 21364) combines aggressive chip-level integration along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads.
Typical directory-based cache coherence protocols suffer from extra messages and protocol processing overheads for a number of protocol transactions. These problems are the result of various mechanisms used to deal with resolving races and deadlocks and the handling of xe2x80x9c3-hopxe2x80x9d transactions that involve a remote node in addition to the requester and the home node (Where the directory resides). For example, negative-acknowledgment messages (NAKs) are common in several cache coherence protocols for dealing with races and resolving deadlocks, which occurs when two or more processors are unable to make progress because each requires a response from one or more of the others in order to do so. The use of NAKs also leads to non-elegant solutions for livelock, which occurs when two or more processors continuously change a state in response to changes in one or more of the others without making progress, and starvation, which occurs when a processor is unable to acquire resources.
Similarly, 3-hop transactions (e.g., requester sends a request, home forwards request to owner, owner replies to requester) typically involve two visits to the home node (along with the corresponding extra messages to the home) in order to complete the transaction. At least one cache coherence protocol avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. However, this cache coherence protocol places strict ordering requirements on the underlying transaction-message interconnect/network, which goes even beyond requiring point-to-point ordering. These strict ordering requirements are a problem because they make the design of the network more complex. It is much easier to design the routing layer if each packet can be treated independent of any other packet. Also, strict ordering leads to less than optimal use of the available network bandwidth.
The system and method disclosed in the parent application of this present application does not place ordering requirements on the underlying transaction-message interconnect/network and avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. Further, the parent application does not disclose a cache coherence protocol for handling store-conditional memory transactions.
However, many processor architectures, including the Alpha, support generalized atomic operations through store-conditional memory transactions. Most Alpha systems implement store-conditional memory transactions using a lock-flag and lock-address process. In such systems, if the lock-flag is still set when a store-conditional memory transaction is executed, then the locked line is either in an exclusive or a shared state. If it is in the exclusive state, then the store-conditional memory transaction immediately succeeds because the requesting node (i.e., the node initiating the store-conditional memory transaction) holds exclusive access to the line. If the line is in the shared state, then the requesting node must attempt to obtain exclusive access to the line and complete the store-conditional memory transaction by sending a store-conditional request to the home node (i.e., the node maintaining a directory entry for the line that is the subject of the store-conditional memory transaction). Since another node may get exclusive access to the line first, the store-conditional memory transaction may fail.
Some computer systems use a centralized directory scheme in which the directory entry for each memory line stores information indicating exactly which nodes share copies of the memory line. For instance, the directory entry may contain a bit vector, where each node of the computer system is uniquely represented by one of the bits of the bit vector (i.e., each bit of the bit vector represents no more than one of the nodes). When a bit of the bit vector is set, a node corresponding to the bit is a sharer (i.e., has a copy) of the memory line corresponding to the directory entry. The bit vector in these types of directory entries is sometimes called an xe2x80x9cexact bit vector.xe2x80x9d In computer systems having a centralized directory scheme and directory entries with exact bit vectors, the success or the failure of the store-conditional memory transaction is decided at the home node. If a store-conditional request reaches the home node, and the bit in the directory entry corresponding to the requesting node is not set (i.e., not set to 1), then some other node has already modified the line. The home node must then send a reply to the requesting node indicating that the store-conditional request, and thus the store-conditional memory transaction, has failed. Conversely, if the bit corresponding to requesting node is set (i.e., set to 1), the requesting node still has a copy of the line. In this case the home node modifies the state of the line to exclusive and sends a reply indicating that the store-conditional request, and thus the store-conditional memory transaction, is successful. The home node also sends out invalidation requests to all the other nodes that have a copy of the line (as indicated by the other bits of the directory entry for the memory line). It is important, in order to avoid livelock, that no invalidations are sent out when a store-conditional memory transaction fails.
Other computer systems that have a centralized directory scheme use a directory entry for each memory line that stores only xe2x80x9ccoarsexe2x80x9d information indicating which nodes share copies of the memory line. For instance, the directory entry may contain a coarse bit vector, where at least one bit of the bit vector represents two or more nodes. When a bit of the bit vector is set, at least one nodexe2x80x94from a set of nodes corresponding to the bitxe2x80x94is a sharer (i.e., has a copy) of the memory line corresponding to the directory entry. The bit vector in these types of directory entries is sometimes called a xe2x80x9ccoarse bit vector.xe2x80x9d
In a multi-node computer system that uses directory entries having coarse bit vectors, the home node for a memory line is unable to determine from the directory entry for the memory line exactly which nodes have copies of the memory line. The home node cannot, therefore, determine if the requesting node of a store-conditional memory transaction (i.e., the sender of the store-conditional request) still has a copy of the line and therefore cannot determine based solely on the information stored in the directory entry whether the store-conditional request, and thus the store-conditional memory transaction, should succeed.
In the computer system disclosed in the parent application, the directory entry for a memory line transitions from an exact sharer node representation to a coarse bit vector representation when (A) the number of nodes in the computer system exceeds the number of bits in the directory entry""s bit vector and (B) the number of nodes that are sharers of the memory line exceeds a threshold value (e.g., the threshold value is four nodes in one embodiment).
The present invention is an extension of a system including a plurality of processor nodes configured to execute a cache coherence protocol that avoids the use of negative acknowledgments (NAKs) and ordering requirements on the underlying transaction-message interconnect/network, services most 3-hop transactions with only a single visit to the home node, and implements store-conditional memory transactions when a directory entry corresponding to the subject of a store-conditional memory transaction is in a coarse vector representation.
In one aspect of the invention, a store-conditional memory transaction succeeds if a directory tracking, the state of a memory line of information unambiguously indicates that the requesting node is the exclusive owner of the memory line of information, if the directory ambiguously indicates that the requesting node is sharing the memory line of information and the requesting node is in fact sharing the memory line of information, or if the directory unambiguously indicates that the requesting node is sharing the memory line of information. But the store-conditional memory transaction fails if the directory unambiguously indicates that the requesting node is not sharing the memory line of information.
In another aspect of the invention, a protocol engine included in a requesting node sends a store-conditional request concerning an identified memory line of information to a home node, which stores the identified memory line of information. A protocol engine included in the home node sends a may-succeed reply to the requesting node in response to the store-conditional request if a directory included in the home node ambiguously indicates that the requesting node is sharing the memory line of information. The protocol engine included in the requesting node sends a responsive protocol message to the home node in response to the may-succeed reply if the requesting node is sharing the memory line of information. The protocol engine included in the home node modifies an entry in the directory corresponding to the memory line of information to indicate that the requesting node is exclusive owner of the memory line of information in response to the responsive protocol message. Finally, either the requesting node or the home node sends an invalidation request to other nodes sharing the memory line of information. These nodes respond by sending an invalidation acknowledgment to the requesting node and invalidating a local copy of the memory line of information.