1. Field of the Invention
The present invention generally relates to computer systems, and more specifically to an agent and method for managing a cache memory in a computer system. In particular, the present invention makes more efficient use of a cache by allocating specific class sets to specific types of program data and instructions.
2. Description of Related Art
The basic structure of a conventional computer system includes one or more processing units connected to various input/output devices for the user interface (such as a display monitor, keyboard and graphical pointing device), a permanent memory device (such as a hard disk, or a floppy diskette) for storing the computer""s operating system and user programs, and a temporary memory device (such as random access memory or RAM) that is used by the processor(s) in carrying out program instructions. The evolution of computer processor architectures has transitioned from the now widely-accepted reduced instruction set computing (RISC) configurations, to so-called superscalar computer architectures, wherein multiple and concurrently operable execution units within the processor are integrated through a plurality of registers and control mechanisms.
The objective of superscalar architecture is to employ parallelism to maximize or substantially increase the number of program instructions (or xe2x80x9cmicro-operationsxe2x80x9d) simultaneously processed by the multiple execution units during each interval of time (processor cycle), while ensuring that the order of instruction execution as defined by the programmer is reflected in the output. For example, the control mechanism must manage dependencies among the data being concurrently processed by the multiple execution units, and the control mechanism must ensure the integrity of data that may be operated on by multiple processes on multiple processors and potentially contained in multiple cache units. It is desirable to satisfy these objectives consistent with the further commercial objectives of increasing processing throughput, minimizing electronic device area and reducing complexity.
Both multiprocessor and uniprocessor systems usually use multi-level cache memories where typically each higher level is smaller and has a shorter access time. The cache accessed by the processor, and typically contained within the processor component of present systems, is typically the smallest cache.
Both operand data and instructions are cached, and data and instruction cache entries are typically loaded before they are needed by operation of prefetch units and branch prediction units. Called xe2x80x9cstreamsxe2x80x9d, groups of instructions associated with predicted execution paths can be detected and loaded into cache memory before their actual execution. Likewise data patterns can be predicted by stride detection circuitry and loaded before operations requiring the data are executed.
Cache memories are typically organized in a matrix arrangement. One direction in the matrix corresponds to congruence classes and the other, equivalent sets within each congruence class. The congruence class partitioning divides the use of the cache with respect to information type, typically a portion of the address field of a memory location is used to partition the distribution of values from memory across the congruence class sets. In this way, it is only necessary to examine the entries within a given class in order to determine memory conflicts or whether a particular value at a given address is present or absent from the cache.
By dividing the cache into congruence classes, the efficiency of access to the cache is improved, but memory organization within a computer system may not provide for an efficient distribution of reuse patterns in the cache if the typical method of using a portion of a memory address as a class selector is followed. Such a method tends to randomize the association of memory with congruence classes rather than to optimize them. Increasing associativity is one solution, but a fully associative cache, which would be the ideal extreme solution, is not efficient in that a separate tag comparison is required for every entry in the cache when an entry is identified. To do this simultaneously requires a hardware comparator for every entry in the cache. The advantage of a cache with reduced associativity, is that a particular class selector can be used to reduce the number of locations for which a tag must be compared, and thus the number of comparators or comparison cycles required.
As semiconductor processes have improved, processor clock frequencies have generally increased more significantly than the latencies incurred when processors retrieve operand data and instructions from memory have decreased (i.e., decreases in memory access time). As measured in processor clock ticks, these latencies have increased significantly.
Processor architects have addressed this problem by incorporating several levels of cache to retain frequently used data and instructions closer to the processor (from a latency perspective), as well as using techniques to hide or overlap the incurred latencies, e.g., prefetching, speculative execution, out-of-order execution, and multi-threading.
These approaches have achieved varying success in alleviating the negative effects of waiting for retrieval of operand data and instructions from memory (or, in the case of multi-level cache hierarchies, from a more distant level in the hierarchy). Since processor core performance has improved, and a portion of the latency problem has been alleviated, the remaining cache misses have a greatly amplified negative impact on performance. Therefore, current trends strongly indicate an opportunity to improve performance by reducing the number of remaining cache misses.
Also, as semiconductor processes have improved, cache sizes have generally increased more than processor clock frequency. Caches typically retain more of the operand data and instructions used by the processor within a window determined by a number of processor clock ticks. In general, processor architects have implemented cache management methods that rely upon data and instruction usage patterns from the recent past as a predictor of data and instruction usage patterns in the near future. These techniques are based upon the principles of temporal and spatial locality, which state that within a given span of time, there is an increased probability that spatially localized regions of data and instructions will be reused.
While these methods generally improve performance for a wide variety of applications, they produce diminishing returns once a cache is large enough to hold a significant portion of the working set (or footprint) of the instructions and operand data for a given application. In general purpose, time sharing, multi-programmed computer systems, caches contain the working sets of multiple applications, middleware, and operating system code. As caches become larger, more of the operand data and code in these working sets can be retained within the cache. Current cache management methods that favor recently used instructions and operand data, may allow a less valuable xe2x80x9cfringexe2x80x9d part of a currently executing working set to displace a highly valuable xe2x80x9ccorexe2x80x9d of a currently dormant working set that will resume execution in the near future.
In light of the foregoing, given the increasing positive impact of cache miss reduction on computer system performance, and given the decreasing returns yielded by presently used cache management methods, it would be desirable to implement a cache method and a cache architecture that will advantageously manage the retention of operand data and instructions within a cache to improve availability and latency.
It is therefore one object of the present invention to provide an improved cache memory for a computer system.
It is another object of the present invention to provide a computer system using such a cache memory.
It is yet another object of the present invention to provide a computer system and processor that provide more efficient caching of instruction and data values by selectively assigning cache class sets to particular types of instruction and data values.
The foregoing objects are achieved in a method and apparatus for operating a cache memory in a computer system wherein the congruence class selector is formed by a portion of a memory address combined with a group selector. The group selector may be provided by a special purpose register in a processor in the computer system, may be provided by an immediate operand field, or may be provided by a general purpose register associated with a base register referenced by a data access instruction. A collection of heaps may be created for memory allocation and one or more group selectors associated with the heaps. The group selector may be stored in a structure describing an allocated memory block for retrieval by the processor when accessing that block. The congruence class selector may be a combination of a partial address bit mask and the group selector.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.