1. Field of the Invention
The present invention generally relates to computer systems, specifically computer cache memory, and more particularly to a method of managing a distributed cache structure of a multiprocessor computer system.
2. Description of the Related Art
The basic structure of a conventional computer system 10 is shown in FIG. 1. Computer system 10 may have one or more processing units, two of which 12a and 12b are depicted, which are connected to various peripheral devices, including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as random access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units 12a and 12b communicate with the peripheral devices and memory by various means, including a generalized interconnect or bus 20. Computer system 10 may have many additional components which are not shown, such as serial, parallel and universal bus ports for connection to, e.g., modems, printers or network interface cards. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access memory 16, etc. Also, instead of connecting I/O devices 14 directly to bus 20, they may be connected to one or more secondary (I/O) buses via I/O bridges connected to bus 20. The computer can have more than two processing units.
In a symmetric multiprocessor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. A typical architecture is shown in FIG. 1. A processing unit includes a processor core 22 having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. The processing unit can also have one or more caches, such as an instruction cache 24 and a data cache 26, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from the more remote memory 16. These caches are referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit 12 can include additional caches, such as cache 30, which is referred to as a level 2 (L2) cache since it supports the on-board (level 1) caches 24 and 26. In other words, cache 30 acts as an intermediary between memory 16 and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache 30 may be a chip having a storage capacity of 512 kilobytes, while the processor may have on-board caches with 64 kilobytes of total storage. Cache 30 is connected to bus 20, and all loading of information from memory 16 into processor core 22 usually comes through cache 30. Although FIG. 1 depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels of interconnected caches. Furthermore, cache 30 may also be an on-board cache. Caches are said to be horizontally oriented when they are on the same level of the memory hierarchy (e.g., caches 24 and 26), and are said to be vertically oriented when they are on different levels of the memory hierarchy (e.g., caches 24 and 30).
A cache has many blocks which individually store the various instructions and data values. The blocks in any cache are divided into groups of blocks called sets or congruence classes. A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset (variable) mapping functions. The number of blocks in a set is referred to as the associativity of the cache, e.g. 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into; however, several different blocks in main memory can be mapped to any given set. A 1-way set associate cache is direct mapped, that is, there is only one cache block that can contain a particular memory block. A cache is said to be fully associative if a memory block can occupy any cache block, i.e., there is one congruence class, and the address tag is the full address of the memory block.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system (to indicate the validity of the value stored in the cache). The address tag is usually a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache “hit.” The collection of all of the address tags in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and the collection of all of the value fields is the cache entry array.
When all of the blocks in a congruence class for a given cache are full and that cache receives a request, whether a “read” or “write,” to a memory location that maps into the full congruence class, the cache must evict one of the blocks currently in the class. The cache chooses a block by one of a number of means such as least recently used (LRU) algorithm, random, pseudo-LRU, etc. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache (in the case of an L1 cache) or main memory (in the case of an L2 cache), as depicted in the two-level architecture of FIG. 1. By the principle of inclusion, the lower level of the hierarchy will already have a block available to hold the written modified data. However, if the data in the chosen block is not modified, the block can be simply abandoned and not written to the next lowest level in the hierarchy. At the end of this process, the cache no longer holds a copy of the evicted block. When a device such as the CPU or system bus needs to know if a particular cache line is located in a given cache, it can perform a “snoop” request to see if the address is in the directory for that cache. Various techniques have been devised to optimize cache usage, such as special cache instructions and coherency states.
As multiprocessor systems have grown in size and complexity, there has been an evolution in the memory hierarchy toward the computer system topology known as non-uniform memory access (NUMA), which addresses many of the limitations of SMP computer systems at the expense of some additional complexity. A typical NUMA computer system includes a number of interconnected nodes that each have one or more processors and a local “system” memory. Such computer systems are said to have a non-uniform memory access because each processor has lower access latency with respect to data stored in the system memory at its local node than with respect to data stored in the system memory at a remote node.
In addition to non-uniform main (system) memory, multiprocessor systems can also employ a non-uniform cache architecture (NUCA). NUCA systems are becoming more prevalent as improvements in silicon technology allow increasingly larger amounts of caches and multiple processors to be incorporated into a single integrated circuit (IC) chip. In a NUCA scheme, the overall cache structure is distributed among many smaller cache banks or ways scattered on the IC chip. A cache block mapping function can spread a cache set across multiple banks. This arrangement will result in two processors on the chip having different latencies to different ways of the same set, and the latency of accessing a cache line from a remote cache way can be significantly higher than the latency of accessing it from a way that is closer to the processor. Thus, an L1 or L2 cache access may have considerably different latencies depending on the location of the bank holding the way where the requested value resides.
At any given moment, if a memory block is stored in a NUCA cache, it can only be located in one of the cache ways in a set. Throughout program execution, the cached value (program instruction or operand data) may move closer to the processor that accesses it more often due to natural cache usage and eviction. There is, however, a problem in the design of such multiprocessor systems wherein multiple processors share a NUCA cache. The value may move back and forth between horizontal cache banks of the two (or more) processors. This situation can result in a thrashing effect when there is a high rate of usage of that memory block by both processors, leading to inefficiencies and bottlenecks in overall processing throughput. It would, therefore, be desirable to devise an improved method of managing a distributed cache structure which mitigates or removes unwanted horizontal thrashing while retaining the benefits of a NUCA cache.