1. Field of the Invention
The present invention generally relates to computer systems, and more specifically to an improved method of accessing memory values (operand data or instructions) used by a processor of a computer system. In particular, the present invention makes more efficient use of a multi-level cache hierarchy, and ports values directly to, e.g., a rename register, instruction buffer, or translation table of the processor without the need for load queues or reload buffers in high level caches.
2. Description of Related Art
The basic structure of a conventional computer system includes one or more processing units connected to various input/output devices for the user interface (such as a display monitor, keyboard and graphical pointing device), a permanent memory device (such as a hard disk, or a floppy diskette) for storing the computer""s operating system and user programs, and a temporary memory device (such as random access memory or RAM) that is used by the processor(s) in carrying out program instructions. The evolution of computer processor architectures has transitioned from the now widely-accepted reduced instruction set computing (RISC) configurations, to so-called superscalar computer architectures, wherein multiple and concurrently operable execution units within the processor are integrated through a plurality of registers and control mechanisms.
The objective of superscalar architecture is to employ parallelism to maximize or substantially increase the number of program instructions (or xe2x80x9cmicro-operationsxe2x80x9d) simultaneously processed by the multiple execution units during each interval of time (processor cycle), while ensuring that the order of instruction execution as defined by the programmer is reflected in the output. For example, the control mechanism must manage dependencies among the data being concurrently processed by the multiple execution units, and the control mechanism must ensure that integrity of sequentiality is maintained in the presence of precise interrupts and restarts. The control mechanism preferably provides instruction deletion capability such as is needed with instruction-defined branching operations, yet retains the overall order of the program execution. It is desirable to satisfy these objectives consistent with the further commercial objectives of minimizing electronic device count and complexity.
An illustrative embodiment of a conventional processing unit for processing information is shown in FIG. 1, which depicts the architecture for a PowerPC(trademark) microprocessor 12 manufactured by International Business Machines Corp. (IBMxe2x80x94assignee of the present invention). Processor 12 operates according to reduced instruction set computing (RISC) techniques, and is a single integrated circuit superscalar microprocessor. As discussed further below, processor 12 includes various execution units, registers, buffers, memories, and other functional units, which are all formed by integrated circuitry.
Processor 12 is coupled to a system bus 20 via a bus interface unit BIU 30 within processor 12. BIU 30 controls the transfer of information between processor 12 and other devices coupled to system bus 20 such as a main memory 18. Processor 12, system bus 20, and the other devices coupled to system bus 20 together form a host data processing system. Bus 20, as well as various other connections described, include more than one line or wire, e.g., the bus could be a 32-bit bus. BIU 30 is connected to a high speed instruction cache 32 and a high speed data cache 34. A lower level (L2) cache (not shown) may be provided as an intermediary between processor 12 and system bus 20. An L2 cache can store a much larger amount of information (instructions and operand data) than the on-board caches can, but at a longer access penalty. For example, the L2 cache may be a chip having a storage capacity of 512 kilobytes, while the processor may be an IBM PowerPC(trademark) 604-series processor having on-board caches with 64 kilobytes of total storage. A given cache line usually has several memory words, e.g., a 64-byte line contains eight 8-byte words.
The output of instruction cache 32 is connected to a sequencer unit 36 (instruction dispatch unit). In response to the particular instructions received from instruction cache 32, sequencer unit 36 outputs instructions to other execution circuitry of processor 12, including six execution units, namely, a branch unit 38, a fixed-point unit A (FXUA) 40, a fixed-point unit B (FXUB) 42, a complex fixed-point unit (CFXU) 44, a load/store unit (LSU) 46, and a floating-point unit (FPU) 48.
The inputs of FXUA 40, FXUB 42, CFXU 44 and LSU 46 also receive source operand information from general-purpose registers (GPRs) 50 and fixed-point rename buffers 52. The outputs of FXUA 40, FXUB 42, CFXU 44 and LSU 46 send destination operand information for storage at selected entries in fixed-point rename buffers 52. CFXU 44 further has an input and an output connected to special-purpose registers (SPRs) 54 for receiving and sending source operand information and destination operand information, respectively. An input of FPU 48 receives source operand information from floating-point registers (FPRs) 56 and floating-point rename buffers 58. The output of FPU 48 sends destination operand information to selected entries in floating-point rename buffers 58.
As is well known by those skilled in the art, each of execution units 38-48 executes one or more instructions within a particular class of sequential instructions during each processor cycle. For example, FXUA 42 performs fixed-point mathematical operations such as addition, substraction, ANDing, ORing, and XORing utilizing source operands received from specified GPRs 50. Conversely, FPU 48 performs floating-point operations, such as floating-point multiplication and division, on source operands received from FPRs 56. As its name implies, LSU 46 executes floating-point and fixed-point instructions which either load operand data from memory (i.e., from data cache 34) into selected GPRs 50 or FPRs 56, or which store data from selected GPRs 50 or FPRs 56 to memory 18.
Processor 12 may include other registers, such as configuration registers, memory management registers, exception handling registers, and miscellaneous registers, which are not shown. Processor 12 carries out program instructions from a user application or the operating system, by routing the instructions and operand data to the appropriate execution units, buffers and registers, and by sending the resulting output to the system memory device (RAM), or to some output device such as a display console.
Register sets such as those described above limit superscalar processing, simply due to the number of registers that are available to a particular execution unit at the beginning of instruction execution (i.e., the registers must be shared among the different execution units). Moreover, superscalar operations are typically xe2x80x9cpipelined,xe2x80x9d that is, a plurality of processing stages are provided for a given execution unit, with each stage able to operate on one instruction at the same time that a different stage is operating on another instruction, so the registers must be further shared. The problem is exacerbated when a long sequence of instructions requires access to the same register set. Furthermore, programmers often use the same registers as temporary storage registers rather than moving data to and from system memory (since the latter process takes a large amount of time relative to processor speed), so a small register set can cause a xe2x80x9cbottleneckxe2x80x9d in the performance stream. Techniques have been devised for expanding the effective number of available registers, such as by providing register renaming (using rename buffers 52 and 58). Register renaming provides a larger set of registers by assigning a new physical register every time a register (architected) is written. A physical register is released for re-use when an instruction that overwrites the architected state maintained in that register completes.
One problem with conventional processing is that operations are often delayed as they must be issued or completed using queues or buffers. For example, when the processor executes a load instruction (via load/store unit 46), the data (L1) cache 34 is first examined to see if the requested memory block is already in the cache. If not (a xe2x80x9ccache missxe2x80x9d), the load operation will be entered into a load queue (not shown) of the cache. The load queue severely limits the number of outstanding loads that can be pending in the system. Typically, there are only two or three entries in the load queue, as most systems rely on the assumption that the majority of accesses will be for operand data that is already in the L1 cache (cache xe2x80x9chitsxe2x80x9d). If the load queue is already full and another cache miss occurs, the processor core stalls until an entry in the queue becomes available.
Several other processing delays are associated with the operation of, or interaction with, the caches, particularly the L1 cache. For example, on a cache miss with a set associative cache, it is necessary to select a cache line in a particular set of the cache for use with the newly requested data (a process referred to as eviction or victimization). The request cannot be passed down to the lower storage subsystem until a victim is chosen. If the chosen victim has been previously modified (the object of a store operation), then the modified value must be aged out (cast out). The logic unit used to select the victim, such as a least-recently (or less recently) used (LRU) algorithm, must also be updated in the L1 cache. These steps are located in the critical path of processor core execution.
Similarly, a reload buffer (not shown) is used to temporarily hold values before they are written to the L1 cache to handle cache read/write collisions. When the lower level memory hierarchy supplies the value requested by a load operation, the response (operand data and address) first enters the reload buffer.
Delays may likewise occur for store (write) operations. Which use a store queue. These types of delays can also arise with operations whose targets are other than register renames, such as instruction fetch units, or translation tables requesting addresses. Translation tables commonly used in processors include translation lookaside buffers which convert physical addresses to virtual addresses (for either instructions or operand data, i.e., ITLBs and DTLBs), or effective-to-real address tables (ERATs).
An additional delay is presented by the requirement that the entire cache line be received by the L1 cache prior to passing the critical value on to the appropriate element within the processor (e.g., to a register rename buffer, translation lookaside buffer, or instruction dispatch unit). In fact, the entire cache line of, say, 64 bytes must be loaded into the L1 cache even though the processor only requested an 8-byte word (the L1 cache controller provides the smaller granularity on the processor output side).
As noted above, a cache line victim representing modified data must be written to the lower levels of the memory hierarchy; this is true for a xe2x80x9cwrite-backxe2x80x9d cache, where data values are not immediately passed on to the remainder of the memory hierarchy after a store operation. Caches can also be xe2x80x9cwrite-through,xe2x80x9d but this leads to increased demands on bus bandwidth. Write-back caches use state information bits to maintain consistency within the overall memory hierarchy (coherency), combined with the monitoring (snooping) of memory operations. One example of the state information is that supplied by the xe2x80x9cMESIxe2x80x9d cache coherency protocol, wherein a cache line can be in one of four coherency states: Modified, Exclusive, Shared or Invalid. Cache coherency protocols introduce further complexities and requirements into the interaction of the caches.
In light of the foregoing, it would be desirable to provide a method of speeding up core processing by improving the operation of the caches, particularly the L1 cache. It would be particularly advantageous if the method could provide values (instructions or operand data) more directly to processor components, i.e., without requiring the use of so many intervening queues and buffers, and allow more flexibility in the interaction between a cache and a processor or between vertically adjacent caches (e.g., L1 and L2) in a multi-cache hierarchy.
It is therefore one object of the present invention to provide an improved data processing system having one or more local caches in the memory hierarchy.
It is another object of the present invention to provide such an improved data processing system having a multi-level cache structure, and at least one layered cache wherein one or more cache functions are handled by a lower level cache.
It is yet another object of the present invention to provide a memory structure for a computer system which speeds up memory accesses by removing or distancing cache functions from the critical path of core execution.
The foregoing objects are achieved in a method of accessing values stored in a memory array of a computer system, comprising the steps of issuing a request from a device of the computer system to load a value from the memory array, the device having a first granularity for receiving memory lines from said memory array, and a second granularity for receiving a specific subset of the first granularity, and sending a pair of flags along with the request which specify which granularities are requested from the memory subsystem. If both granularities of data are to be returned to the requesting device, then the two granularities are returned via two separate data bus transactions. The invention may support heterogenous devices on the system bus. The requesting device could be an I/O device which may only be able to use the first granularity, in which case it sets the outbound flags to request only the first granularity. More particularly, the device may be a processing unit which includes at least one cache with cache lines having the first granularity, and a requested value having the second granularity is register data. When the cache issues a system bus address transaction due to a processor load request which missed in the cache, the cache may set the outbound flags to request only the second granularity, or the first granularity, or both granularities. The advantage of requesting only the second granularity (register data) is that it does not require that the cache controller allocate a full cache line reload buffer to receive the data. This approach enables the implementation of a larger number of queues in the cache controller not all of which require a data reload buffer large enough to hold a full cache line of data. Also, the advantage of requesting both the first granularity and the second granularity is that even if the full cache line of data is desired by the cache controller, the second granularity (the register data requested by the processor core) can typically be returned by the memory subsystem with a lower latency than that for a full cache line. Therefore, the register data can be forwarded to the requesting core before the full cache line which contains the requested data is received from memory. When the memory subsystem returns the requested data, the granularity of the data bus transaction is determined by a pair of inbound flags. The first flag identifies the data as being of the first granularity or the second granularity. If both granularities were requested, the second (smaller) granularity is always returned with the first of two separate bus transactions. When the second granularity is returned (in the first bus transaction), the second flag indicates whether the first granularity (the second bus transaction) will occur or not. This approach allows the memory subsystem to imprecisely return the first granularity even though both granularities were requested. Subsequently, this also means that even if the device requested both granularities, the device is still able to accept only the second (smaller) granularity. The advantage of returning only the second granularity (register data) is that it does not require that the memory controller allocate a full cache line data buffer to return the data. This enables the implementation of a larger number of queues in the memory controller not all of which require a data buffer large enough to hold a full cache line of data.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 is a block diagram of a conventional superscalar computer processor, depicting execution units, buffers, registers, and the on-board (L1) data and instruction caches;
FIG. 2 is an illustration of one embodiment of a data processing system in which the present invention can be practiced;
FIG. 3 is a block diagram illustrating selected components that can be included in the data processing system of FIG. 2 according to the teachings of the present invention;
FIG. 4 is a block diagram of a processing unit constructed in accordance with one embodiment of the present invention, depicting operation of a cache structure which includes an L1 operand data cache;
FIG. 5 is a block diagram of a processing unit constructed in accordance with another embodiment of the present invention, depicting operation of a cache structure which includes an L1 instruction cache; and
FIG. 6 is a block diagram of a memory management unit constructed in accordance with another embodiment of the present invention, depicting operation of a translation lookaside buffer for storing page table entries.