The present invention relates generally to cache coherence logic for a multiprocessor computer system, and particularly to circuitry and methodologies for generating directory entries and error correction codes for the memory lines stored in the memories of a cache coherent multiprocessor system.
This application is related to the following U.S. patent applications, all of which are hereby incorporated by reference:
Cache Coherence Protocol Engine and Method for Processing Memory Transaction in Distinct Address Subsets During Interleaved Time Periods in a Multiprocessor System, Ser. No. 09/878,983, filed Jun. 11, 2001, attorney docket number 9772-0322-999;
Scalable Multiprocessor System And Cache Coherence Method, Ser. No. 09/878,982, filed Jun. 11, 2001, attorney docket number 9772-0326-999;
System and Method for Daisy Chaining Cache Invalidation Requests in a Shared-memory Multiprocessor System, Ser. No. 09/878,985, filed Jun. 11, 2001, attorney docket number 9772-0329-999; and
Multiprocessor Cache Coherence System and Method in Which Processor Nodes and Input/Output Nodes Are Equal Participants, Ser. No. 09/878,984, filed Jun. 11, 2001, attorney docket number 9772-0324-999.
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, integrated designs such as the Alpha 21364 provide glueless scalable multiprocessing, whereby a large server can be built in a modular fashion using only processor and memory chips.
The integration of the coherence hardware and memory controllers on the same chip leads to interesting design choices for how and where to store directory information. One extremely attractive option is to support directory data with virtually no memory space overhead by computing memory error correction codes (ECC""s) at a coarser granularity and utilizing the unused bits for storing the directory information.
Given the trend towards larger main memories, dedicated directory storage can become a significant cost factor. Similarly, providing a dedicated datapath for a separate external directory memory reduces the number of pins available for supporting memory and interconnect bandwidth. Therefore, it would be desirable to avoid or reduce the costs, interconnect and latency problems associated with large main memories and their directory storage.
In a multiprocessor computer system, each respective node includes a main memory, a cache memory system and logic. The main memory stores data in a plurality of memory lines with a directory entry for each memory line. The directory entry indicates whether a copy of the corresponding memory line is stored in the cache memory system in another node. The cache memory system stores copies of memory lines from the main memories of the various nodes, and furthermore stores cache state information indicating whether the cached copy of each memory line is an exclusive copy of the memory line.
The logic of each respective node is configured to respond to a transaction request for a particular memory line and its associated directory entry, where the respective node is the home node of the particular memory. When the cache memory system of the home node stores a copy of the particular memory line and the cache state information indicates that the copy of the particular memory line is an exclusive copy, the logic responds to the request by sending the copy of the particular memory line retrieved from the cache memory system and a predefined null directory entry value, and thus does not retrieve the memory line and its directory entry from the main memory of the home node.
In another aspect of the present invention, the logic of each respective node is further configured to respond to the transaction request, when the cache memory system of the respective node does not store an exclusive copy of the particular memory line, by retrieving the particular memory line and the associated directory entry from the main memory of the respective node and sending the retrieved particular memory line and associated directory entry.
Yet another aspect of the present invention relates to an efficient ECC-based directory implementation for use in a scalable multiprocessor computer system. A combined ECC is used to detect and correct errors in both the data and directory entry of each memory line. The use of a combined data and directory ECC reduces the number of ECC bits needed for each memory line, which in turn permits a larger number of bits to be use for directory storage than if separate ECC""s were maintained for the data and the directory entry of each memory line. New logic and methods are used to eliminate or reduce potential problems associated with using a combined ECC for the data and directory.