1. Field of the Invention
The current invention generally relates to computing systems. More specifically, the present invention relates to flushing (purging) a cache memory in a processor node. The current invention can be used advantageously to flush a cache memory in a processor node prior to removing the node from a computing system.
2. Description of the Related Art
Early computing systems began as single processor systems, with the single processors using a single level of memory. The memory was typically magnetic cores in the 1960s, for example, and was very expensive. There was no “memory hierarchy”, other than the single level of memory, and, of course, punch cards, disks, tapes, and the like.
As memory evolved, a range of price/performance memory options became available, leading to use of cache memories in the 1960s. The processor often had to wait for data from the memory system, and use of cache memories reduced the average wait for data by keeping recently used data in faster but more expensive memory. For example, fetching data from a cache memory might take ten processor cycles whereas fetching data from a higher level memory might take 100 processor cycles.
Modern computing systems can have many levels of cache in a memory hierarchy. For example, Levels 1-4, with Level 5 (L5) being what is often called “main memory”, that is, the largest capacity memory (and the slowest, typically) in the computing system that is directly addressable by the processor. In the example, level 1(L1) is the lowest level in the memory hierarchy, Level 2 (L2) is higher than L1, and so on. Some modern processor chips contain three levels of cache (L1, L2, and L3) on the processor chip itself. Furthermore, large computing systems are made up of multiple nodes, each node having one or more processors.
A particular level of cache can hold a relatively small amount of data, compared to higher level caches, or main memory. When a processor requests data (a first cache line) that is not in the particular level of cache, the first cache line must be retrieved from a higher level in the memory hierarchy and written into the particular level of cache. A second cache line already in the particular level of cache must be replaced by the first cache line. If the second cache line contains the most recently updated version of the contents of the second cache line, the second cache line must be written to a higher level in the memory hierarchy before the first cache line can be written into the particular level of cache. It will be understood that modern computer systems often implement a castout buffer to hold evicted cache lines. That is, an evicted cache line can be written to the castout buffer rather than physically to the higher level in the memory hierarchy. The first cache line can be written into the particular level of cache after the second cache line has been written to the castout buffer in such a computer system. If the second cache line contains unmodified data, the first cache line can simply be written, replacing the second cache line. The cache management system must keep track of when cache lines are modified.
Modern cache designs incorporate associativity into caches. An associative cache divides a cache array into a number of sets (e.g., sets zero to N), each set having a number of classes (e.g., classes zero to three). A particular cache line fetched by a particular address, in the example of a four way associative cache, will be cached in only one set, but can be cached in any one of the four classes within the set. The cache controller must determine which one of the four (in the example) classes is to be replaced when a new cache line is to be written into the cache. Several schemes have been used for this purpose. A simple scheme is a round robin replacement. That is, for a given set in a four way associative cache, classes zero, one, two, three are replaced in that order, with class zero again being replaced after class three has been replaced. A second scheme is a LRU (Least Recently Used) algorithm. LRU replaces the class in an addressed set that was referenced longest ago, expecting that classes that have been recently used will be more likely to be used again. A third scheme is random replacement, where classes are picked “at random” (actually pseudo random, since a random pattern generator implemented on the chip will eventually repeat the patterns used in picking the class). Random replacement is commonly used on large caches in modern computing systems.
Modern large computing systems are constructed with more than one node, each node having one or more processors. Each node also contains a portion of the total main memory of the computing system. For example, if a computing system has four nodes and a total main memory of 256 GB (gigabytes), each node may have a 64 GB addressing range assigned to it, and have 64 GB of main memory built on the node. It will be understood that various computer manufacturers may use the term “cell” rather than “node”. The present invention is not limited to NUMA (Non Uniform Memory Access) and is equally suitable to SMP (Symmetric Multiprocessor) architectures. A cache on a first node is likely to contain one or more cache lines that are in an address space assigned to a second node. For example, a processor owning an address range of the first 64 GB of total computing system address space fetches and modifies data that is in the 64 GB address range of the second node. One or more cache lines are sent from the second node to the first node and cached there, and perhaps modified by the first node in the course of processing. A desirable feature of large computing systems is that a particular node can be removed while other nodes remain operational. Prior to removing the particular node, modified cache lines must be flushed (purged) from the particular node and returned to the node owning the address of the modified cache lines.
Although flushing a cache, in principle, is easy to do, actual implementation can be difficult. For example, an address mechanism in the cache controller associated with the cache could simply “walk through” the addresses of the cache, including the class selection. However, one or more logic blocks would have to be added to the addressing mechanism, which is typically not acceptable, since cache access is often a critical path which determines the clock frequency at which the processor can be run. In simple “round robin” replacement schemes, a sufficient number of addresses known to map to a set (and not currently in the set) can be fetched. For example, if a cache is four way associative, having four classes in each set, and the replacement algorithm is “round robin”, four addresses mapped to the set will suffice to flush the set. Cache lines addressed by these four addresses must not already be in the cache. Similarly, with an LRU replacement algorithm in the example, a set can be flushed with four fetches using suitable addresses that map to the set, because, upon replacement, a particular class in the set will become the “most” recently used. A second such fetch will replace the currently least recently used class, and so on, until the fourth (in the example) fetch will complete the flushing of the set. Cache lines addressed by these suitable addresses must not already be in the cache. A problem arises, however, when the commonly used random replacement scheme is used. Depending on how many bits are used in the random pattern generator, which determines how often patterns repeat, a very large number of fetches might have to be performed to guarantee that all classes in a set have been flushed.
Therefore, there is a need for a method and apparatus that allow flushing of a cache in a node in a multinode computing system where the class replacement scheme is a random replacement scheme. In particular, no delay penalty should be paid for the normal address mechanism.