1. Technical Field
This invention relates generally to processing local memory-related transactions within a node of a cache coherent non-uniform memory access (NUMA) system, and more particularly to processing such transactions in which information regarding the access of the local memory by other nodes is needed.
2. Description of the Prior Art
There are many different types of multi-processor computer systems. A Symmetric Multi-Processor (SMP) system includes a number of processors that share a common memory. SMP systems provide scalability for multithreaded applications and allow multiple threads to run simultaneously. As needs dictate, additional processors, memory or IO can be added. SMP systems usually range from two to 128 or more processors. One processor generally boots the system and loads the SMP operating system, which brings the other processors online. Without partitioning, there is only one instance of the operating system in memory. Since all processors access the same memory, sharing of data can be accomplished simply by placing the data in memory. The operating system uses the processors as a pool of processing resources, all executing simultaneously, where each processor either processes data or is in an idle loop waiting to perform a task. SMP system throughput increases whenever processes can be overlapped until all processors are fully utilized.
A Massively Parallel Processor (MPP) system can use thousands or more processors. MPP systems use a different programming paradigm than the more common SMP systems. In an MPP system, each processor contains its own memory and copy of the operating system and application. Each subsystem communicates with the others through a high-speed interconnect. To use an MPP system effectively, an information-processing problem should be breakable into pieces that can be solved simultaneously. The problem must be broken down with nodes explicitly communicating shared information via a message passing interface over the interconnect. For example, in scientific environments, certain simulations and mathematical problems can be split apart and each part processed at the same time.
A non-uniform memory access (NUMA) system is a multi-processing system in which memory is separated into distinct banks. NUMA systems are a type of SMP systems. In uniform memory access (UMA)-SMP systems, all processors access a common memory at the same speed. NUMA systems are usually broken up into nodes, or building blocks, containing one to eight, or more, processors. The nodes typically also contain a portion of the global memory. The memory local to a node typically is closer than memory in more distant parts of the system, in both physical and logical proximity, and thus is accessed faster. That is, local memory is accessed faster than distant shared memory. NUMA systems generally scale better to higher numbers of processors than UMA-SMP systems, due to the distribution of memory causing less contention in the memory controller.
Each building block, or node, typically caches the distant shared, or remote, memory to improve memory access performance either in cache memory internal to the processor or in node-level cache memories. The node where the memory resides is referred to as the home node. A coherency controller within the home node is used to track what copy of the line of memory is valid, the copy in memory or the copy in a remote cache memory, and where copies of the memory line are cached. A line of memory, or a memory line, is generally considered one or more memory locations within the memory that are capable of storing data. A line of memory may, for instance, correspond to one or more bytes of memory, or one or more words of memory.
The coherency controller ensures that the correct copy of a line is accessed and cached copies are kept up to date. The coherency controller may issue operations for a cache memory line to effect a transaction request. The coherency controller transmits operations to remote coherency controllers to read or invalidate copies of the line of memory that is being cached, as well as reads data from local memory when needed. To prevent needless data regarding the remote caching of the local memory being constantly sent among the nodes, such data is stored at the home node for the local memory, in what is referred to as a directory. That is, without a directory, the home node would have to poll every other node in the system to determine whether the home node's local memory is being remotely cached by these other nodes, which can cause significant traffic on the interconnect connecting the nodes to one another. Having a directory within the home node that stores information regarding whether the other nodes are remotely caching the home node's local memory means that the home node does not have to constantly poll the other nodes of the system to determine the caching status of the home node's local memory.
The directory can either be a full directory where each line in memory has a directory entry, or a sparse directory where each directory entry can store caching information regarding one of a number of different memory lines, such that the directory is considered a cache of directory entries. In a sparse directory there is a tag entry within the directory cache memory for each cache memory location within the directory cache memory. The tag entry may indicate, for instance, what memory location is being cached at its corresponding cache memory location, what other nodes are caching the memory location in their cache memories, and the status of the cache memory location. For performance reasons, directories are usually constructed from fast memory. This is so that memory accesses throughout the system are not unduly slowed.
However, the utilization of even fast tag memory can slow down processing of memory-related transactions within a node. Processing of such transactions usually occurs within a coherency controller of the node. The coherency controller of the node has to access the directory, which may be located outside of the controller, or implemented within embedded memory of the coherency controller, in order to process a given memory-related transaction. Even where the tag memory is fast and embedded within the coherency controller, transaction processing time is lengthened because the controller cannot complete the processing until the directory access is completed. For these and other reasons, therefore, there is a need for the present invention.