1. Field of the Invention
The present invention generally relates to computer systems and, more particularly, to a method of controlling evictions from a cache used by a computer processor.
2. Description of the Related Art
The basic structure of a conventional multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has several processing units, two of which, 12a and 12b, are depicted, which are connected to various peripheral devices, including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as random-access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units 12a and 12b communicate with the peripheral devices by various means, including a generalized interconnect or bus 20. Computer system 10 may have many additional components which are not shown, such as serial and parallel ports for connection to modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video-display monitor, a memory controller can be used to access memory 16, etc. The computer can also have more than two processing units.
In a symmetric multi-processor (SMP) computer, all of the processing units are generally identical; that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. A typical architecture is shown in FIG. 1. A processing unit includes a processor core 22 having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. An exemplary processing unit includes the PowerPC.TM. 604-series processor marketed by International Business Machines Corporation. The processing unit can also have one or more caches, such as an instruction cache 24 and a data cache 26, which are implemented using high-speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory 16. These caches are referred to as "on-board" when they are integrally packaged with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit 12a can include additional caches, such as cache 30, which is referred to as a level 2 (L2) cache since it supports the on-board (level 1) caches 24 and 26. In other words, cache 30 acts as an intermediary between memory 16 and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, while the processor may be an IBM PowerPC.TM. 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 30 is connected to bus 20, and all loading of information from memory 16 into processor core 22 must come through cache 30. Although FIG. 1 depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels of serially connected caches.
A cache has many "blocks" which individually store the various instructions and data values. The blocks in any cache are divided into groups of blocks called "sets." A set is the collection of blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset mapping functions. The number of blocks in a set is referred to as the associativity of the cache (e.g. 2-way set associative means that, for any given memory block, there are two blocks in the cache that the memory block can be mapped into). However, several different blocks in main memory can be mapped to any given set.
When all of the blocks in a set for a given cache are full and that cache receives a request, whether a "read" or "write," to a memory location that maps into the full set, the cache must "evict" one of the blocks currently in the set. The cache chooses a block by one of a number of means known to those skilled in the art (least recently used (LRU), random, pseudo-LRU, etc.) to be evicted. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache (in the case of the L1 or on-board cache) or main memory (in the case of an L2 cache, as depicted in the two-level architecture of FIG. 1). However, if the data in the chosen block is not modified, the block is simply abandoned and not written to the next lowest level in the hierarchy. This process of removing a block from one level of the hierarchy is known as an "eviction." At the end of this process, the cache no longer holds a copy of the evicted block.
Another aspect of symmetric multiprocessors which is relevant to the invention relates to the necessity of providing a means of synchronizing the actions of the various processors in a system to allow cooperation among processors working on a task. To achieve this, most modern processors include in their instruction sets explicit instructions to handle synchronization. In particular, the PowerPC.TM. instruction set provides two instructions known as "lwrx" and "stcx." These instructions come in two forms: "lwarx" and "stwcx" for 32-bit implementations and "ldarx" and "stdcx" for 64-bit implementations. Henceforth, the terms "lwarx" and "stwcx" are used to denote instructions for either implementation (the ldarx and stdcx instructions have essentially the same semantics, with the exception that ldarx and stdcx operate on 8-byte quantities and lwarx and stwcx operate on 4-byte quantities). These instructions serve to build synchronization primitives.
The lwarx instruction loads an aligned 4-byte word of memory into a register in the processor. In addition, lwarx places a "reservation" on the block of memory that contains the word of memory accessed. A reservation contains the address of the block and a flag. This flag is made active, and the address of the block is loaded when a lwarx instruction successfully reads the word of memory referenced. If a reservation is valid (the flag is active) the processor and the memory hierarchy are obligated to cooperatively monitor the entire system for any operation that may write to the block for which the reservation exists. If such a write occurs, the flag in the reservation is reset. The reservation flag is used to control the behavior of the stwcx instruction.
The stwcx instruction is the counterpart to lwarx. The stwcx instruction first determines if the reservation flag is valid. If so, the stwcx instruction performs a store to the 4-byte word of memory specified, sets a condition code register to indicate that the store succeeded, and resets the reservation flag. If, on the other hand, the reservation flag in the reservation is not valid, the stwcx instruction does not perform a store to memory and sets a condition code register indicating that the store failed. The stwcx instruction is often referred to as a "conditional store" due to the fact that the store is conditional on the status of the reservation flag.
The general concept underlying the lwarx/stwcx instruction sequence is to allow a processor to read a memory location, modify the location in some way, and to store the new value to memory while ensuring that no other processor has altered the memory location from the point in time when the lwarx was executed until the stwcx completes. Such a sequence is usually referred to as an "atomic read-modify-write" sequence because the processor was able to read the location, modify it, and then write the new value without interruption by another processor writing to the location. The lwarx/stwcx sequence of operations does not occur as one uninterruptable sequence, but rather, the fact that the processor is able to execute a lwarx and then later successfully complete the stwcx ensures the programmer that the read/modify/write sequence did, in fact, occur as if it were atomic. This atomic property of a lwarx/stwcx sequence can be used to implement a number of synchronization primitives well-known to those skilled in the art.
FIG. 1 depicts two reservation units 32 and 34 which are associated, respectively, with caches 26 and 30. These units contain the reservation, both the address and the flag, and they each "snoop" (monitor) their respective buses 36 and 38 for any write operation within the reservation granule address, and invalidate the associated reservation flag when such an operation is detected (if a reservation-killing operation is detected by a lower-level cache, it is sent up to the higher-level caches). As such, they monitor the buses and respond to bus transactions in a manner similar to the caches themselves. The reservation unit addresses and flags are usually set in one of two general ways. If a processor attempts to issue a lwarx to a memory location whose block is not present in any cache of its memory hierarchy, a read operation is propagated from the processor at the top of the hierarchy through each of the caches in the hierarchy and finally out on the generalized interconnect 20 to be serviced. These read operations are tagged with a special indicator to inform the reservation units in the caches that the read is for a lwarx and that the reservation units should set the address and flag. Alternatively, a processor can issue a lwarx to a memory location in a block already present in the L1 cache 26. This situation is known as an "lwarx hit." In this case, the processor's reservation unit 32 will set its reservation address and flag and will issue a special bus operation known as a lwarx reserve (hereafter RESERVE) on the connection 36 between the L1 and L2 caches. The L2 cache will receive the RESERVE message, which includes the address of the reservation, and will set its reservation address and flag in its reservation unit 34. If other cache levels are present (not shown in FIG. 1), the L2 cache will forward the RESERVE message on to any lower caches in the memory hierarchy, which will repeat the actions taken by the L2 cache, at which point, all the reservation units will be properly set. The process of propagating the RESERVE messages down through all cache levels can take an arbitrary amount of time, in general, depending on availability of the inter-cache connections (e.g., 36 and 38) and the specific details of the particular implementation.
There is one other way that the reservation units can be set. This situation occurs when a block has been partially, but not completely, evicted from a cache hierarchy. For example, assume that the processor core 22 executes a lwarx instruction to an address that is in a block not present in the L1 cache, but is present in the L2 cache. In this case, processor core 22 will issue a read that is marked as a read for a lwarx to the L2 cache. The L2 cache will determine that it has a copy of the block and return this block to the processor core directly. Once the block is returned to the processor core, the processor core updates its reservation address and flag in reservation unit 32. The L2 cache will also set its reservation and send a RESERVE bus operation to any lower level caches to inform them of the reservation. This scenario is merely a combination of the two cases described earlier. In general, a read from the processor core with the lwarx indication propagates down the hierarchy, setting reservation units until it encounters a cache (potentially the L1) that has a copy of the block which satisfies the read. That cache then propagates a RESERVE bus operation down the remainder of the hierarchy to set the remaining reservation units. In this manner, all of the reservation units in the hierarchy are loaded as a result of a lwarx instruction with the proper reservation information and can begin snooping for transactions that write to the reservation granule. This allows the reservation units to reset the reservation flags and prevent a stwcx instruction from completing when the memory location for the reservation could, potentially, have been modified.
One problem with prior-art SMP systems relates to the eviction of a block having a data value which is the subject of a lwarx reservation. Nearly every lwarx instruction is eventually followed by a stwcx instruction (there is no need to place a reservation on a block of memory unless the conditional store operation is to be used later for an atomic read-write sequence). However, a relatively large amount of time can pass between execution of a lwarx instruction and an associated stwcx instruction, for various reasons. During the interim, it is possible that a memory block which has been loaded into a given cache will be evicted as a result of other instructions executed by the processor. This outcome would be undesirable since the memory block would eventually need to loaded into the cache(s) again for execution of the stwcx instruction, creating an unnecessary delay. It is even possible for a reserved block to be evicted, loaded again, and evicted again (several times) before execution of the stwcx. This inefficiency imposes a severe performance degradation and is a limitation of the prior-art systems. It would, therefore, be desirable to devise a more efficient method of implementing lwarx/stwcx semantics, so as to speed up processing of those instructions. It would be particularly advantageous if the method were able to prevent unnecessary evictions of a reserved memory block.