1. Techinical Field
The present invention generally relates to computer systems, and more specifically to an improved method of deallocating cache entries from an upper level cache used by a processor core of a computer system. In particular, the present invention makes more efficient use of a cache hierarchy by managing cache entries that are written to lower levels of cache after they have been modified.
2. Description of the Related Art
The basic structure of a conventional computer system includes one or more processing units connected to various input/output devices for the user interface (such as a display monitor, keyboard and graphical pointing device), a permanent memory device (such as a hard disk, or a floppy diskette) for storing the computer""s operating system and user programs, and a temporary memory device (such as random access memory or RAM) that is used by the processor(s) in carrying out program instructions. The evolution of computer processor architectures has transitioned from the now widely-accepted reduced instruction set computing (RISC) configurations, to so-called superscalar computer architectures, wherein multiple and concurrently operable execution units within the processor are integrated through a plurality of registers and control mechanisms.
The objective of superscalar architecture is to employ parallelism to maximize or substantially increase the number of program instructions (or xe2x80x9cmicro-operationsxe2x80x9d) simultaneously processed by the multiple execution units during each interval of time (processor cycle), while ensuring that the order of instruction execution as defined by the programmer is reflected in the output. For example, the control mechanism must manage dependencies among the data being concurrently processed by the multiple execution units, and the control mechanism must ensure the integrity of data that may be operated on by multiple processes on multiple processors and potentially contained in multiple cache units. It is desirable to satisfy these objectives consistent with the further commercial objectives of increasing processing throughput, minimizing electronic device area and reducing complexity.
Both multiprocessor and uniprocessor systems usually use multi-level cache memories where typically each higher level is smaller and has a shorter access time. The cache accessed by the processor, and typically contained within the processor component of present systems, is typically the smallest cache. As such, the cache entries available at the highest level cache are often being reallocated. This is due to requests for new data and a need for space to store that data within the higher levels of cache memory. As new reads are performed up the levels of cache, locations must be deallocated or xe2x80x9cfreedxe2x80x9d in order to make room for the new data. This is known as cache location xe2x80x9cvictimizationxe2x80x9d and the selected target to deallocate is known as the xe2x80x9cvictimxe2x80x9d.
If a cache entry is being deallocated and values within the entry have been modified, the cache entry is considered xe2x80x9cdirtyxe2x80x9d and must not only be deallocated, but must be written or xe2x80x9cflushedxe2x80x9d to the lower levels in the memory hierarchy in order to maintain coherency. In a multiprocessor system, the cache must not only be coherent with the lower levels, but different cache entries overlapping the same address may be loaded in to more than one processor""s high level cache. This raises the complexity of a system that maintains the coherence of the entire memory hierarchy and impacts the design and operation of every level of the memory hierarchy.
A typical cache memory hierarchy contains a least-recently-used array (LRU). This array allows the cache controller to determine which entry to deallocate when a request for a new allocation is made. It has been found in the past that an efficient method for selecting the deallocated target is the deallocate the least recently used entry in the cache, based on the assumption that it has the lowest probability of being required again before another entry that has been more recently used. The deallocated entry must be written to a lower level of the memory hierarchy if it is xe2x80x9cdirtyxe2x80x9d. A typical cache hierarchy will then place this entry in the next lower level of cache memory and flag it as the most recently used entry (since it has just been accessed). If the new entry for which space is being allocated in the higher level has already been read from the next lower level cache (a desirable sequence since the read allocation is memory that is needed immediately or at least predictably soon by the core), then the deallocated entry after it has been flushed will have a position that is more recently used than a new read allocation. Under certain circumstances this may not be a desirable condition. Infrequently used memory locations may end up being preserved in intermediate levels of cache at the expense of deallocating and reallocating entries that are more frequently used.
In light of the foregoing, it would be desirable to provide a method of speeding up core processing by improving cache deallocation mechanisms, particularly with respect to the interaction of the mechanism with the cache hierarchy. It would be further advantageous if the method allowed a programmer to optimize various features of the deallocation mechanism.
It is therefore one object of the present invention to provide an improved processor for a computer system, having one or more caches in a memory hierarchy.
It is another object of the present invention to provide a computer system using such a processor, which also has one or more caches in the memory hierarchy.
It is yet another object of the present invention to provide a computer system and processor that make more efficient use of a cache hierarchy.
The foregoing objects are achieved in a method and apparatus for managing a multi-level memory hierarchy of a computer system, implementing the steps of receiving an allocation request in a lower level cache, determining that the allocation is for a castout from a higher level cache and based on that determination, selecting a target cache line in the lower level. The method and apparatus may further select a target cache line for castout writes only from a subset of congruence classes in the lower level cache. The LRU of the lower level may be set to a fixed value for castout writes and the value can be provided by programming a register. The method of deallocation from the higher level cache comprises receiving an allocation request, selecting a target cache line to castout, storing the values from the castout target in a temporary location, reading the requested line from the lower level, updating an LRU in the lower level so that the write of the castout is not marked as most recently used and finally writing the castout value to the lower level.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.