1. Field of the Invention
The present invention relates to an architecture of a microprocessor, and particularly to a microprocessor provided with a non-blocking load feature.
2. Description of Related Art
One of primary objects for microprocessor developers is to improve instruction processing speed of a microprocessor. Using cache memory is well known as a technique to improve processing speed of a microprocessor. The cache memory is used to eliminate a difference between processing speed of a microprocessor and data transfer rate of a main memory such as DRAM.
However if data to be used in an operation is not stored in a cache memory, data must be transferred from a low-speed main memory that resides outside a processor. Where data to be used in a microprocessor does not exist in a cache memory on a cache access, it is referred to as a “miss hit” or a “cache miss”. When a cache miss occurs, the cache memory cannot be accessed during a data transfer from the main memory to the cache memory, accordingly an execution of subsequent load instructions and store instructions, not only an execution of the load instruction the cache miss is occurred in, is stopped. In a microprocessor that performs pipeline process, a cache miss makes a pipeline stall, thereby causing to reduce its performance.
A microprocessor provided with non-blocking load feature for avoiding a generation of pipeline stall due to such a cache miss is widely known. In a case when a preceding load instruction encounters a cache miss, the non-blocking load feature temporarily saves the instruction the cache miss is generated therein, in order to allow subsequent load instructions to continue accessing a cache memory (See Shen, John Paul, and Mikko H. Lipasti., Modern Processor Design: Fundamentals of Superscalar Processors, Beta ed. McGraw-Hill, 2002, pp. 201-204).
On the other hand a microprocessor is to handle data by a unit of a given data length, where most of the current processors define this processing unit as 32 bits (4 bytes) or 64 bits (8 bytes). The processing unit is defined as a “word”. A unit of 32 bits is referred to as 1 word, while a unit of 64 bits is referred to as double words hereinafter. It is possible to achieve high-speed processing by aligning data length of peripheral devices such as a cache memory according to a data unit of a microprocessor. For example a line width of a cache memory is configured to be 1 word or a multiple thereof, so that in one cache access, 1 word or a double words data can efficiently be loaded. Further, a unit of data for loading and storing is either 8 bits (1 byte) or 16 bits (2 bytes or a half word), enabling to access a cache memory or a main memory in the same way as loading by a word (hereinafter referred to as word load) and loading by a double words (hereinafter referred to as double word load).
Storing a data less than 1 word together with other data in 1 word unit could result in the data stored straddling a boundary for 1 word (hereinafter referred to as a word boundary) or a line boundary for a cache memory (hereinafter referred to as a cache line boundary). The data stored straddling a word boundary is hereinafter referred to as an unaligned data.
To align and load such an unaligned data, usually two cache accesses, two or more registers, and two or more logic operations are required. However MIPS (a registered trademark of MIPS Technologies, Inc.) instruction set, which is a major RICS (Reduced Instruction Set Computer) instruction set, is provided with instructions such as LWL (Load Word Left) instruction, LWR (Load Word Right) instruction, LDL (Load Double-word Left), and LDR (Load Double-word Right) instruction. By combining these instructions for execution, only two cache accesses are needed to load an unaligned data. This also promotes efficiency by using only one register.
As an example, a case of loading an unaligned data using a LWL instruction and a LWR instruction is explained hereinafter in detail with reference to FIG. 7. FIG. 7 shows an operation of loading unaligned data (Y1 to Y4) from a cache memory where data are stored in big-endian format.
A mnemonic of a LWL instruction “LWL R18 , 0×2 (R0)” is an instruction to store data from an effective address of a cache memory, which is obtained by adding an offset value 0×2 to a value of base register R0, to a word boundary in an area on the left of a load destination register R18. To be more specific, when assuming a value of a base register R0 to be 0×0, data Y1 and Y2, which are located at addresses from 0×2 specified by an effective address to a left word boundary 0×3, and data B2 and B3, which are originally stored in a target register R18, are merged and stored back to the register R18.
Further, a mnemonic of a LWR instruction “LWR R18 0×5 (R0)” is an instruction to store data from a word boundary of a cache memory to an effective address of a cache memory, which is obtained by adding an offset value 0×5 to a value of base register R0, in an area on the right of a load destination register R18. To be more specific, when assuming a value of a base register R0 to be 0×0, data Y3 and Y4, which are located at addresses from 0×5 specified by an effective address to a smaller address side of a word boundary 0×3, and data Y1 and Y2, which are originally stored in a target register R18, are merged and stored back to the register R18.
As described in the foregoing, for a MIPS instruction set, loading unaligned data (Y1 to Y4) can be achieved by executing LWL and LWR instructions that merges a data less than 1 word with an original data and loads the merged data. Similarly, executing LDL and LDR instructions enables to align and load double words data that are stored unaligned.
Aforementioned instructions such as LWL instruction, LWR instruction, LDL instruction, and LDR instruction for MIPS instruction set are referred to as unaligned load instructions. In other words, an unaligned load instruction is an instruction that reads a data less than 1 word or 1 double words from a cache memory and stores such data in a load destination register, as well as merging a data read from the cache memory with original data (merging data) retained by the load destination register and storing the merged data back to the load destination register.
FIGS. 8 and 9 show configurations of microprocessors capable of executing unaligned load instructions and provided with a non-blocking load feature, according to a conventional technique. FIG. 8 is a view showing an overall configuration of a microprocessor 8.
An instruction fetch unit 12 provides an instruction cache 11 with a content of a program counter (not shown) and then loads an instruction to an instruction register (not shown). The instruction cache 11 is a cache memory for storing an instruction.
An instruction decode unit 13 decodes an instruction and issues the instruction to one of reservation stations 15 to 18, depending on types of the instruction. The instruction decode unit 13 provides a register file 14 with a register number assigned by an operand and then content of an operand register outputted from the register file 14 is stored to reservation stations 15 to 18.
If all operands are available in the reservation stations 15 to 18, an instruction is executed in execution units such as an integer arithmetic unit 20 and a load/store unit (LSU) 80.
Load instructions including an unaligned load instruction are executed in LSU 80. A configuration of LSU 80 for executing a load instruction is illustrated in FIG. 9.
When executing a load instruction in LSU 80, contents of the three instruction operands retained by the reservation station 18 are set to a Sop register 101, an offset register 102, and a Top register 103. Specifically, a base register value for generating an affective address of the cache memory is set to the Sop register 101. An offset value for calculating an effective address is set to the offset register 102. Further, a content stored in a load destination register, where a loaded data is stored thereto, before an execution of a load instruction is set to the Top register 103.
An address generation unit 107 adds values of the Sop register 101 and offset register 102 so as to generate an effective address of the data cache 23.
A cache control unit 108 refers to the effective address created by the address generation unit 107 and if the operation results in a cache miss, the cache control unit 108 saves the load instruction in a fetch queue 109. This enables subsequent load instructions to be executed without waiting for the data to be loaded from a main memory 24, thereby avoiding a generation of a pipeline stall. FIG. 9 shows a configuration that allows 4 instructions to be stacked in the fetch queue 109, so that even if 4 cache misses occur, subsequent instructions can still continue to access the data cache 23. In a case if cache misses are occurred exceeding the number of instructions that can be saved in a fetch queue 109, the cache control unit 108 outputs busy signals to the instruction decode unit 13 and reservation station 18, whereby subsequent instructions are aborted to be issued while the busy signals continues to being outputted.
When executing an unaligned load instruction such as above-mentioned LWL instruction and LWR instruction, a returned data from the data cache 23 is merged with merging data which is set to the Top register 103 in a data merge unit 810, and then the merged data is stored to a load destination register of the register file 14. As for a normal load instruction other than unaligned load instructions, the data merge unit 810 stores a returned data from the data cache 23 to the register file 14.
In a conventional microprocessor 8 employing a non-blocking load feature, even if a cache miss is generated when executing a normal load instruction other than an unaligned load instruction, it is possible to access to a data cache by saving the instruction in a fetch queue 109 and executing subsequent load instructions while refilling the data cache.
However it has now been discovered that there still is a problem that a pipeline stall can be generated even with a non-blocking load feature if an instruction to be executed is an unaligned load instruction. The pipeline stall occurs even on a cache hit, not only when an unaligned instruction encounters a cache miss, loading from a main memory, or performing an uncached load.
As described in the foregoing, when executing an unaligned load instruction such as LWL and LWR instruction, content stored to a load destination register before executing the unaligned load instruction must be stored to the Top register 103 and retained there until the data merge unit 810 refers to merging data. Executing an unaligned load instruction usually requires two or more cycles to retrieve a returned data from the data cache 23, even on a cache hit. Therefore, until data is retrieved from the data cache 23, a conflict for the Top register 103 is generated between the unaligned load instruction and the subsequent instructions, consequently blocking an execution of the subsequent instructions. On a cache miss or a uncache load, longer pipeline stall could be generated. As stated above, even with a fetch queue 109 for non-blocking load, executing an unaligned load instruction prevents from continuing non-blocking operation and eventually generates a pipeline stall.
For this reason, an aforementioned conventional microprocessor 8 is configured in a way that when an unaligned instruction evaluation unit 804 acknowledges that an unaligned load instruction is issued, the unaligned instruction evaluation unit 8 outputs busy signals to the instruction decode unit 13 and the reservation station 18 aside from a status of the fetch queue 109 in order to block executions of subsequent instructions.