1. Field of the Invention
The present invention relates to the field of microprocessor load/store execution units. Specifically, the invention relates to dynamically controlling the priority between a load buffer and a store buffer which compete for memory resources in a load/store unit of a microprocessor.
2. Discussion of the Prior Art
RISC stands for Reduced Instruction Set Computing. RISC processors normally include fewer instructions than their CISC (Complex Instruction Set Computing) predecessors. At least to some extent, the use of RISC technology simplifies the task of writing compilers for the RISC processors, since there are fewer instructions to use. Moreover, the processor design can focus on implementing and optimizing the important, frequently-used instructions, rather than having some complex, seldom-used instruction constrain the maximum operating frequency. For these reasons and others, RISC processors are gaining popularity amongst workstation and even lower-end computer makers.
In a RISC processor, it is common that very few instructions actually access memory. In fact, some implementations may have only two instructions, LOAD and STORE, that access memory. Typically, a few specialized "atomic" operations may also be supported by the processor for synchronization and memory updates by concurrent processes. See The SPARC Architecture Manual, Version 9, SPARC International, Inc., D. L. Weaver and T. Germond, Editors, PTR Prentice Hall, 1994, p.17. Even in such cases, the LOAD and STORE are by far the most frequently-used memory access instructions. The execution of a LOAD instruction will cause a processor register to be written with data resident in a main memory address. The execution of a STORE instruction will cause a main memory address to be written with data resident in a processor register. Therefore, LOAD and STORE instructions need to specify a register or registers and a memory address as operands. In some implementations, two registers or a register and an immediate value are used to calculate the memory address. Many implementations provide several varieties of LOADs and STOREs which transfer one or more words of data using one instruction; therefore, more than one source or destination register may exist. These source or destination registers are also designated as operands in the STORE or LOAD instructions, respectively.
Referring to FIG. 1, a superscalar processor 12 is shown. The superscalar processor 12 has several parallel functional units 10 and 11. Some typical superscalar implementations include floating point, integer, branch, and load/store functional units. Thus, up to four instructions may be executed in parallel. Most superscalar processors will also be pipelined at least to some extent. Pipelining allows the processing of instructions to occur in stages, so that it takes several clock cycles before the processing of an instruction is completed. However, in a pipelined processor architecture, several instructions can be simultaneously processed since each stage is always processing an instruction.
Referring to FIG. 2, a block diagram of the load/store unit 10 is shown. In order to maximize the total throughput of the load/store functional unit 10, a load buffer 20 and a store buffer 21 are provided so that loads and stores that cannot be immediately serviced can still be dispatched, and can wait to be serviced without holding up the operation of the processor 12. The load buffer 20 is a first-in-first-out (FIFO) queue which holds LOAD source memory 22 and destination register addresses 23. The store buffer 21 is a FIFO queue which holds STORE addresses 24 and the data 25 to be stored to those addresses. By maintaining queues for LOADs and STOREs, the utilization of memory resources used by the LOAD and STORE instructions is maximized since another memory operation is immediately available after the current operation is finished.
A common form of multi-layer memory system is cache memory, or lookaside buffer memory. A cache memory is a rather small, specialized memory device placed between the processor 12 and main memory system. The cache memory holds copies of words from main memory that are likely to be accessed by the processor 12. The cache is faster than main memory, thus if frequently accessed locations are found in the cache, the average memory access time will be small. Cache memory locations are redundant, in the sense that each is used to provide a more accessible copy of information also stored in slower main memory. The total addressable memory size, as seen by the programmer, is not increased by the presence of a cache. Many processors 12 have small internal caches and much larger external caches which are implemented on a separate integrated circuit. In other systems, only one cache exists between the processor 12 and main memory.
FIG. 2 shows a load/store pre-processing block 19 which interfaces with an internal cache 19 and the load and store buffers 20 & 21. The load and store buffers 20 & 21 interface with an external cache 26. Because the external cache 26 is on a separate integrated circuit, only one read/write port normally exists into the external cache 26. In modern processors, the unit of memory transfer, or subblock, is on the order of 144 bits or more, while the address for that sub-block is on the order of 20 bits or more. Most systems do not provide dual-ports for the external cache since it is far too expensive to do so. Providing two ports would require essentially doubling the already large number of pins on the external cache memory chip. This modification would increase the complexity of the package, increase the complexity of the external cache RAM (random access memory), and increase the number of wires on the circuit board. Potentially, the die size of the external cache could also increase. Moreover, the power consumption of the system would increase significantly since output pad drivers tend to consume a relatively large amount of the total power. For all these reasons, the external cache 26 normally has only one set of address lines 27 and one bi-directional set of data lines 28.
Because the external cache RAM 26 only has one read/write port, the entries in the load buffer 20 and the entries in the store buffer 21 tend to compete for the external cache RAM 26 resource. At any given time, both the load buffer 20 and the store buffer 21 may have entries. Since only one LOAD or STORE instruction may be processed by the external cache 26 at any given time, if the load and store buffers 20 & 21 both have entries, one of the buffers will be unable to dispatch its entry to the external cache 26.
A LOAD is a request to fetch data that a subsequent instruction will need to use as an operand. Therefore, a LOAD needs to happen relatively quickly. The compiler typically creates the instruction stream that separates LOADs from the instructions which utilize the loaded data by a sufficiently large time. However, if an instruction is ready to be executed before the load data which that instruction uses has been loaded, the processor 12 stalls. Thus, the latency with which LOADs need to occur is dependent upon how many instructions in the instruction stream separate a LOAD from the use of the loaded data. It is certainly preferable to place fewer constraints on the compiler, and to avoid processor stalls. By reducing the load latency, the number of intervening instructions between a LOAD and the use of the load data can be reduced without risking a processor stall.
The theory behind the load buffer 20 is that for code which has a working set that is too large to fit in the internal cache 29, a LOAD and its succeeding use can be separated, at least far enough to cover the latency of the external cache 26. If the LOAD and its use can be separated, several intervening instructions can be executed while the LOAD is occurring. The load buffer 20 allows the load/store functional unit 10 and the other execution pipelines 11 (in FIG. 1) to be somewhat decoupled in that LOADs which can not return data immediately will not stall the pipeline, but rather, be buffered until they can return data. A typical example of this is when a LOAD misses the internal cache 29 and must access the external cache 26. In that case, the LOAD will be placed in the load buffer 20, and the other functional unit pipelines 11 will continue moving so long as they do not encounter a "use" instruction.
On the other hand, STORE instructions do not have the strict timing constraints that LOADS have. When a STORE is issued, the processor 12 is sending data back to main memory. The stored data will probably not be needed at any time in the near future. If the stored data were needed in the near future, it would have been held in a register rather than stored to main memory.
Since the latency requirements for LOADs are more demanding than those for STORES, previous designs have given the load buffer entries priority over the store buffer entries when the two buffers are in competition for the external cache 26.
By always giving entries in the load buffer 20 priority over the entries in the store buffer 21, the store buffer 21 tends to become full. If the entries of the load buffer 20 are always serviced prior to the entries of the store buffer 21, the entries in the store buffer 21 are never serviced until the load buffer 20 is empty of entries which must access the external cache 26. As long as the load buffer 20 has entries which must access the external cache 26, STOREs are not processed.
Usually, the frequency of STORE operations in a program is significantly lower than that of LOADs. Nonetheless, if STOREs in the store buffer do not gain access to the external cache until the load buffer is empty, the store buffer 21 can fill up to the point that all its entries are occupied. Once the store buffer is full, the processor 12 is blocked from dispatching another STORE instruction. Once a STORE instruction is issued that can not be entered into the store buffer due to the store buffer being full, the load/store functional unit pipeline 10 (in FIG. 1) is blocked. Many modern superscalar processors 12 are intended to work with code which has been compiled so as to provide some flexibility in the order of instruction execution. Potentially however, the entire processor 12 may be blocked since almost all programming models place some constraints on the order of instruction execution. When a STORE instruction in the instruction stream can not be processed, it is possible that a few subsequent instructions may still be able to execute, but eventually the entire processor 12 will stall if the STORE instruction is not processed.
Thus, the prior art approach of allowing the load buffer 20 to always have priority suffers from the problem that the store buffer 21 become full and will block the processor 12 from its operations.