1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to an improved method of accessing a cache memory of a processing unit, wherein the cache has a multi-level architecture, to reduce memory access latency.
2. Description of Related Art
A typical structure for a conventional computer system includes one or more processing units connected to a system memory device (random access memory or RAM) and to various peripheral, or input/output (I/O), devices such as a display monitor, a keyboard, a graphical pointer (mouse), and a permanent storage device (hard disk). The system memory device is used by a processing unit in carrying out program instructions, and stores those instructions as well as data values fed to or generated by the programs. A processing unit communicates with the peripheral devices by various means, including a generalized interconnect or bus, or direct memory-access channels. A computer system may have many additional components, such as serial and parallel ports for connection to, e.g., modems, printers, and network adapters. Other components might further be used in conjunction with the foregoing; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access the system memory, etc.
A conventional processing unit includes a processor core having various execution units and registers, as well as branch and dispatch units which forward instructions to the appropriate execution units. Caches are commonly provided for both instructions and data, to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory (RAM). These caches are referred to as "on-board" when they are integrally packaged with the processor core on a single integrated chip. Each cache is associated with a cache controller or bus interface unit that manages the transfer of values between the processor core and the cache memory.
A processing unit can include additional caches, such as a level 2 (L2) cache which supports the on-board (level 1) caches. In other words, the L2 cache acts as an intermediary between system memory and the on-board caches, and can store a much larger amount of information (both instructions and data) than the on-board caches can, but at a longer access penalty. Multi-level cache hierarchies can be provided where there are many levels of interconnected caches.
A typical system architecture is shown in FIG. 1, and is exemplary of the PowerPC.TM. processor marketed by International Business Machines Corporation. Computer system 10 includes a processing unit 12a, various I/O devices 14, RAM 16, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals whenever the computer is first turned on. Processing unit 12a communicates with the peripheral devices using a system bus 20 (a local peripheral bus (e.g., PCI) can be used in conjunction with the system bus). Processing unit 12a includes a processor core 22, and an instruction cache 24 and a data cache 26, which are implemented using high speed memory devices, and are integrally packaged with the processor core on a single integrated chip 28. Cache 30 (L2) supports caches 24 and 26 via a processor bus 32. For example, cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, while the processor may be an IBM PowerPC.TM. 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 30 is connected to bus 20, and all loading of information from memory 16 into processor core 22 must come through cache 30. More than one processor may be provided, as indicated by processing unit 12b.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multi-processor computer system (indicating the validity of the value stored in the cache). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache "hit." The collection of all of the address tags in a cache is referred to as a directory (and sometimes includes the state bit and inclusivity bit fields), and the collection of all of the value fields is the cache entry array.
One can think of computer system performance as having several components. The first is the performance of the processor as if it had a perfect cache memory, that is, as if the processor core were always able to satisfy memory requests out of its first cache level, with no memory access latency. This mode of operation gives the highest performance, of course, but it is not realistic. In a multi-level cache architecture, the next contribution to the system performance is the mode of operation wherein an access request "misses" at the first level of the cache memory, but retrieves the requested value from the second level of cache memory. This component depends on the number of additional cycles required to access the second level of memory, and is inversely proportional to the frequency with which first level cache misses occur. In computers today, an access to the second level of memory is initiated only after it is determined that a request missed at the first level. As a result, the full access time of the second level cache appears as a performance penalty.
Terms similar to the one representing the performance degradation due to misses at the first cache level can be included in the model of system performance for misses at all levels of the memory hierarchy, to refine the estimate of system performance. It would, therefore, clearly be desirable to reduce the number of additional cycles required to fetch data from higher levels of memory, in order to improve overall system performance. It would be further advantageous if the improvement could be achieved with relatively little hardware expense, and without excessive power requirements.