1. Field of the Invention
The present invention generally relates to computer systems, and more specifically to an improved method of prefetching values (instructions or operand data) used by a processor core of a computer system. In particular, the present invention makes more efficient use of a cache hierarchy working in conjunction with prefetching (speculative requests).
2. Description of Related Art
The basic structure of a conventional computer system includes one or more processing units connected to various input/output devices for the user interface (such as a display monitor, keyboard and graphical pointing device), a permanent memory device (such as a hard disk, or a floppy diskette) for storing the computer""s operating system and user programs, and a temporary memory device (such as random access memory or RAM) that is used by the processor(s) in carrying out program instructions. The evolution of computer processor architectures has transitioned from the now widely-accepted reduced instruction set computing (RISC) configurations, to so-called superscalar computer architectures, wherein multiple and concurrently operable execution units within the processor are integrated through a plurality of registers and control mechanisms.
The objective of superscalar architecture is to employ parallelism to maximize or substantially increase the number of program instructions (or xe2x80x9cmicro-operationsxe2x80x9d) simultaneously processed by the multiple execution units during each interval of time (processor cycle), while ensuring that the order of instruction execution as defined by the programmer is reflected in the output. For example, the control mechanism must manage dependencies among the data being concurrently processed by the multiple execution units, and the control mechanism must ensure that integrity of sequentiality is maintained in the presence of precise interrupts and restarts. The control mechanism preferably provides instruction deletion capability such as is needed with instruction-defined branching operations, yet retains the overall order of the program execution. It is desirable to satisfy these objectives consistent with the further commercial objectives of minimizing electronic device count and complexity.
An illustrative embodiment of a conventional processing unit for processing information is shown in FIG. 1, which depicts the architecture for a PowerPC(trademark) microprocessor 12 manufactured by International Business Machines Corp. (IBMxe2x80x94assignee of the present invention). Processor 12 operates according to reduced instruction set computing (RISC) techniques, and is a single integrated circuit superscalar microprocessor. As discussed further below, processor 12 includes various execution units, registers, buffers, memories, and other functional units, which are all formed by integrated circuitry.
Processor 12 is coupled to a system bus 20 via a bus interface unit (BIU) 30 within processor 12. BIU 30 controls the transfer of information between processor 12 and other devices coupled to system bus 20 such as a main memory 18. Processor 12, system bus 20, and the other devices coupled to system bus 20 together form a host data processing system. Bus 20, as well as various other connections described, include more than one line or wire, e.g., the bus could be a 32-bit bus. BIU 30 is connected to a high speed instruction cache 32 and a high speed data cache 34. A lower level (L2) cache (not shown) may be provided as an intermediary between processor 12 and system bus 20. An L2 cache can store a much larger amount of information (instructions and operand data) than the on-board caches can, but at a longer access penalty. For example, the L2 cache may be a chip having a storage capacity of 512 kilobytes, while the processor may be an IBM PowerPC(trademark) 604-series processor having on-board caches with 64 kilobytes of total storage. A given cache line usually has several memory words, e.g., a 64-byte line contains eight 8-byte words.
The output of instruction cache 32 is connected to a sequencer unit 36 (instruction dispatch unit, also referred to as an instruction sequence unit or ISU). In response to the particular instructions received from instruction cache 32, sequencer unit 36 outputs instructions to other execution circuitry of processor 12, including six execution units, namely, a branch unit 38, a fixed-point unit A (FXUA) 40, a fixed-point unit B (FXUB) 42, a complex fixed-point unit (CFXU) 44, a load/store unit (LSU) 46, and a floating-point unit (FPU) 48.
The inputs of FXTJA 40, FXUB 42, CFXU 44 and LSU 46 also receive source operand information from general-purpose registers (GPRs) 50 and fixed-point rename buffers 52. The outputs of FXUA 40, FXUB 42, CFXU 44 and LSU 46 send destination operand information for storage at selected entries in fixed-point rename buffers 52. CFXU 44 further has an input and an output connected to special-purpose registers (SPRs) 54 for receiving and sending source operand information and destination operand information, respectively. An input of FPU 48 receives source operand information from floating-point registers (FPRs) 56 and floating-point rename buffers 58. The output of FPU 48 sends destination operand information to selected entries in floating-point rename buffers 58.
As is well known by those skilled in the art, each of execution units 38-48 executes one or more instructions within a particular class of sequential instructions during each processor cycle. For example, FXUA 42 performs fixed-point mathematical operations such as addition, subtraction, ANDing, ORing, and XORing utilizing source operands received from specified GPRs 50. Conversely, FPU 48 performs floating-point operations, such as floating-point multiplication and division, on source operands received from FPRs 56. As its name implies, LSU 46 executes floating-point and fixed-point instructions which either load operand data from memory (i.e., from data cache 34) into selected GPRs 50 or FPRs 56, or which store data from selected GPRs 50 or FPRs 56 to memory 18. Processor 12 may include other registers, such as configuration registers, memory management registers, exception handling registers, and miscellaneous registers, which are not shown.
Processor 12 carries out program instructions from a user application or the operating system, by routing the instructions and operand data to the appropriate execution units, buffers and registers, and by sending the resulting output to the system memory device (RAM), or to some output device such as a display console or printer. A computer program can be broken down into a collection of processes which are executed by the processor(s). The smallest unit of operation to be performed within a process is referred to as a thread. The use of threads in modern operating systems is well known. Threads allow multiple execution paths within a single address space (the process context) to run concurrently on a processor. This xe2x80x9cmultithreadingxe2x80x9d increases throughput in a multi-processor system, and provides modularity in a uni-processor system.
One problem with conventional processing is that operations are often delayed as they must wait on an instruction or item of data before processing of a thread may continue. One way to mitigate this effect is with multithreading, which allows the processor to switch its context and run another thread that is not dependent upon the requested value. Another approach to reducing overall memory latency is the use of caches, as discussed above. A related approach involves the prefetching of values. xe2x80x9cPrefetchingxe2x80x9d refers to the speculative retrieval of values (operand data or instructions) from the memory hierarchy, and the temporary storage of the values in registers or buffers near the processor core, before they are actually needed. Then, when the value is needed, it can quickly be supplied to the sequencer unit, after which it can be executed (if it is an instruction) or acted upon (if it is data). Prefetch buffers differ from a cache in that a cache may contain values that were loaded in response to the actual execution of an operation (a load or i-fetch operation), while prefetching retrieves values prior to the execution of any such operation.
An instruction prefetch queue may hold, e.g., eight instructions to provide look-ahead capability. Branch unit 38 searches the instruction queue in sequencer unit 36 (typically only the bottom half of the queue) for a branch instruction and uses static branch prediction on unresolved conditional branches to allow the IFU to speculatively request instructions from a predicted target instruction stream while a conditional branch is evaluated (branch unit 38 also folds out branch instructions for unconditional branches). Static branch prediction is a mechanism by which software (for example, a compiler program) can give a hint to the computer hardware about the direction that the branch is likely to take. In this manner, when a correctly predicted branch is resolved, instruction execution continues without interruption along the predicated path. If branch prediction is incorrect, the IFU flushes all instructions from the instruction queue. Instruction issue then resumes with the instruction from the correct path.
A prefetch mechanism for operand data may also be provided within bus interface unit 30. This prefetch mechanism monitors the cache operations (i.e., cache misses) and detects data streams (requests to sequential memory lines). Based on the detected streams and using known patterns, BIU 30 speculatively issues requests for operand data which have not yet been requested. BIU 30 can typically have up to four outstanding (detected) streams. Reload buffers are used to store the data until requested by data cache 34.
In spite of such approaches to reducing the effects of memory latencies, there are still significant delays associated with operations requiring memory access. As alluded to above, one cause of such delays is the incorrect prediction of a branch (for instructions) or a stream (for operand data). In the former case, the unused, speculatively requested instructions must be flushed, directly stalling the core. In the latter case, missed data is not available in the prefetch reload queues, and a considerable delay is incurred while the data is retrieved from elsewhere in the memory hierarchy. Much improvement is needed in the prefetching mechanism.
Another cause of significant delay is related to the effects that prefetching has on the cache hierarchy. For example, in multi-level cache hierarchies, it might be efficient under certain conditions to load prefetch values into lower cache levels, but not into upper cache levels. Also, when a speculative prefetch request misses a cache, the request may have to be retried an excessive number of times (when the lower level storage subsystem is busy), which unnecessarily wastes bus bandwidth, and the requested value might not ever be used. Furthermore, a cache can easily become xe2x80x9cpollutedxe2x80x9d with speculative request data, i.e., the cache contains so much prefetch data that demand requests (those requests arising from actual load or i-fetch operations) frequently miss the cache. In this case the prefetch mechanism has overburdened the capacity of the cache, which can lead to thrashing. The cache replacement/victimization algorithm (such as a least-recently used, or LRU, algorithm) cannot account for the nature of the prefetch request. Moreover, after prefetched data has been used by the core (and is no longer required), it may stay in the cache for a relatively long time due to the LRU algorithm and might thus indirectly contribute to further cache misses (which is again particularly troublesome with misses of demand requests, rather than speculative requests). Finally, in multi-processor systems wherein one or more caches are shared by a plurality of processors, prefetching can result in uneven (and inefficient) use of the cache with respect to the sharing processors.
Another cause of delay related to multi-level cache hierarchies is the need to access a directory for each level, typically contained within that particular storage level. Directories provide means for indexing values in the data portion of the cache, and also maintain information about whether a cache entry is valid or whether it is xe2x80x9cdirtyxe2x80x9d which means that the data is conditionally invalid due to access by another cache user in a multiprocessor system. Entries in a directory are matched with addresses of values to determine whether the value is present in the level, or must be loaded. The presence of a value is determined by comparing the tag associated with the address of that value with entries in the directory. This is a time consuming process, which can stall the access to the cache waiting for the match to be found.
In light of the foregoing, it would be desirable to provide a method of speeding up core processing by improving the prefetching and cache mechanisms, particularly with respect to the interaction of the prefetching mechanism with the cache hierarchy. It would be further advantageous if the method allowed a programmer to optimize various features of the prefetching mechanism.
It is therefore one object of the present invention to provide an improved cache for a computer system, having a mechanism for improving access to instructions and/or operand data.
It is yet another object of the present invention to provide a computer system that makes more efficient use of a cache hierarchy by improving access to directories in the cache hierarchy.
The foregoing objects are achieved in methods and apparatus for operating a multi-level cache memory in a computer system, comprising the steps of creating a directory describing the contents of a lower-level cache; assigning at least one set from a higher-level cache to contain the directory, and holding the directory in the set(s). The set or sets can further be reassigned to general use if a lower-level cache is detected as absent. This method and apparatus may be nested in that more than one level in a multi-level cache may contain the directory of the next lower level in one or more of its sets. An address comparator may be attached directly to the set, allowing for rapid comparison of the directory entries with address values. A cache may be of a variable latency type, and the set for use with the directory may be specifically chosen based on the latency of the set.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.