Modern high performance processors execute loads and stores out of order. To avoid an error, a load needs to read the data written by an older store to the same address in memory. For example, if a load attempts to execute from a particular address, but there is a pending older store that is waiting to be written to the same particular address, then the execution of the load will be stalled until the older store is written to memory. However, to enhance performance, a conventional load store unit (LSU) in the processor forwards the data from the store to the load, without waiting for the store data to be written to memory.
FIG. 1 shows a conventional out-of-order executing processor 100. The processor 100 may be one of many processor cores that are combined on a single semiconductor chip. The processor 100 includes an integer fetch unit (IFU) 105, an instruction decoder 110, an instruction control unit (ICU) 115, a register file 120, an arithmetic and logic unit (ALU) 125, a write back unit (WBU) 130, an address generation unit (AGU) 135, a load/store unit (LSU) 140, and a data cache (DC) 145.
Still referring to FIG. 1, as an ongoing process, the IFU 105 sends instruction/address requests 150 that request an external memory 155 to send instruction bytes 160 from particular addresses. The IFU 105 outputs the instruction bytes 160 to the instruction decoder 110, which decodes the instruction bytes 160 such that each instruction is uniquely identified by a certain combination of bits. The instruction decoder 110 has the knowledge to interpret these instructions. For example, the instruction decoder 110 may determine whether an instruction is performing an “add” or a “multiply” function.
The instruction decoder 110 feeds to the ICU 115 a series of decoded instructions 162 that are to be executed in a particular order. The ICU 115 orchestrates (i.e., schedules) the execution of the decoded instructions 162. The decoded instructions 162 may be executed out of order to enhance performance. The ICU 115 maintains an in-order table of all of the decoded instructions 162 that the ICU 115 receives from the instruction decoder 110 until particular ones of the decoded instructions 162 are retired.
The ICU 115 outputs ordered decoded instructions 164 to the register file 120. The register file 120 provides operands 166 for executing the ordered decoded instructions 164 to the ALU 125 and the AGU 135, and provides store data 168 to the LSU 140. The ALU 125 executes simple instructions which do not involve memory, (i.e., instructions which are purely arithmetical or purely logical and do not involve memory), and outputs execution results 170 to the WBU 130.
The WBU 130 receives the execution results 170 and outputs feedback execution results 172 to the register file 120, after determining which addresses in the register file 120 to store the feedback execution results 172. The AGU 135 computes the address for loads and stores, and outputs a load/store byte mask (BM) signal 174, a load/store address signal 176 and a load/store data size signal 178 to the LSU 140.
There are two different categories of instructions which are executed by the processor 100: 1) arithmetic and logic instruction; and 2) load/store instructions.
Arithmetic and logic instructions, such as “ADD”, “SUB” and “AND”, are executed by the ALU 125. These instructions read their operands 166 from the register file 120 and write their results back to the register file 120. These results become input operands to subsequent instructions. These instructions typically have a fixed latency, and the number of clock cycles to execute them are known in advance.
Load/store instructions, which involve reading from memory (loads) and writing to memory (stores), are executed by the LSU 140. The AGU 135 generates the address for a load instruction from which the data is read, and the address of a store instruction to which the data is written. Load/store instructions typically have a variable latency.
The LSU 140 outputs stored data 182 to the DC 145 to write data, and outputs an address 184 to the DC 145 to read data. Furthermore, the LSU 140 signals completion of the execution of instructions by sending a load/store complete signal 186 to the ICU 115. The ICU 115 will eventually “retire” these instructions by sending a retire signal 188 to the LSU 140, once they have finished execution, in order to give an appearance of in-order execution. From a programmer's point of view, instructions are considered executed once they have been retired. The DC 145 outputs load data 190 to the LSU 140, which then outputs load data 192 to the register file 120.
FIG. 2 shows the details of the LSU 140 in the conventional processor 100 of FIG. 1. As shown in FIG. 2, the LSU 140 includes a load/store queue (LSQ) 205, a store-to-load interlocking (STLI) content-addressable memory (CAM) 210, a store-to-load forwarding (STLF) CAM 215, a first priority encoder (PE) 220, a second PE 225, a store data buffer (SDB) 230, a multiplexer (MUX) 235 and an alignment unit 240. Loads and stores are executed by allowing them to flow within an LSU pipeline, which is provided by the LSU 140. At the end of the flow, loads successfully complete if they can return valid data (i.e., load data 192) to the register file 120. If a load fails to complete, the load may be returned to the beginning of the flow, (i.e., as operands 166 input into the AGU 135), one or more times until the load successfully returns valid data.
Typically, a store instruction, (i.e., a store), takes the result of a prior computation, saved in the register file 120, and writes it to the DC 145. Stores are executed in two phases: a pre-retire phase and a post-retire phase. In the pre-retire phase, the address of a particular store is computed out of order by the AGU 135 and sent to the LSQ 205, the STLI CAM 210 and the STLF CAM 215 via the load/store address signal 176.
The store instruction that is the oldest memory instruction in the LSQ 205 is executed when the LSQ 205 receives a store address, (the memory location to which the store data 168 needs to be written), via the load/store address signal 176 and a store data size via the load/store data size signal 178. The store address is recorded in the LSQ 205, the STLI CAM 210 and the STLF CAM 215. The store data size is recorded in the LSQ 205 and the STLF CAM 215. The LSQ 205 outputs an address 184 and a BM 245, which are recorded in the STLI CAM 210. The address 184 is also recorded in the STLF CAM 215. The store data 168 is also independently read from the register file 120 when it is ready and is written in the SDB 230.
When all older loads and stores have sent load/store complete signals 186 to the ICU 115, the marked store completes the pre-retire phase of execution and sends its store complete signal 186 to the ICU 115. Upon receiving a load/store complete signal 186, the ICU 115 will retire the store, after all of the older load/store instructions have retired. The ICU 115 will send a retire signal 188 to the LSQ 205 for the currently retired store.
When the LSQ 205 receives the retire signal 188, the marked store enters the post-retire phase from the pre-retire phase. In this phase, the oldest store reads the SDB 230, writes the store data 182 to the DC 145, and is then removed from the LSQ 205. This completes the execution of the post-retire phase and is also referred to as committing the store.
Typically, a load instruction, (i.e., a load), reads data from the DC 145 and writes data to the register file 120. However, if there is at least one older store to the same address as the load in the LSQ 205, the load is required to read the store data from the SDB 230, rather than from the DC 145, since the SDB 230 contains the latest data about to be written. Such stores are also referred to as overlapping stores, if there is one or more bytes that the older store is writing to, and which the load needs to read in order to correctly execute. These stores may be partially overlapping if it is only writing part of the bytes which the load needs.
When the load receives its address 176 from the AGU 135, it is recorded in the LSQ 205 along with the data size 178. When the execution of the load is scheduled by the LSQ 205, it compares all entries in the STLF CAM 215 with its address 184 and data size 250 to see if there are any prior older stores to the same address as that of the load. The STLF CAM 215 includes a table with two fields: a first field indicating a starting word address of the store; and a second field indicating the data size of the store. The load may have one or more matches with an older uncommitted store, (stores not yet written to the DC 145), to the same address as that of the load and whose size is at least as large as the load. An indication of each of these matches is included in an STLF hit signal 260 that is output by the STLF CAM 215 to the PE 225. The PE 225 computes the youngest of the older matching uncommitted stores and outputs an STLF hit entry signal 275 to the SDB 230 to select a store entry that will provide the data for the load. The STB 230 outputs data 280 to a first input of the MUX 235. The DC 145 outputs load data 190 to a second input of the MUX 235. On a hit in the STLF CAM 215, the MUX 235 is controlled by a store data source select signal 285 output by the LSQ 205 to allow the data 280 to become the load data, instead of the load data 190. The output 290 of the MUX 235 is a word having multiple bytes. The alignment unit 240 is controlled by DC data alignment signal 295 to shift the bytes in the output 290 based on which byte is to be read.
However, it is quite possible that not all bytes needed to be read by the load are being written by store, resulting in a partial overlap between the load and the store. When there is a partial overlap between the bytes written by the store and the bytes read by the load, the store and the load do not have an exact match in their starting address, and/or the store size is smaller than the load size. Thus, the STLF CAM 215 will not generate the STLF hit signal 260 and the load will not be able to successfully forward from the store, resulting in a failed execution of the load.
In some cases, all the bytes needed by the load may be supplied by an older store. However, the starting address between the load and the store may not match. In this case also, the STLF CAM 215 will not generate an STLF hit signal 260, resulting in a failed execution of the load.
The STLI CAM 210, similar to the STLF CAM 215, includes a table with two fields: a first field indicating the starting word aligned address of the store; and a second field indicating each of the bytes within the starting word aligned address. Along with the lookup of the STLF CAM 215, the STLI CAM 210 is looked up by the load, and all of its entries that are older than the load are compared. If there is a word aligned address match and at least one byte match between the load and any older store entry, then the entries overlap. An STLI hit signal 255 indicating an overlap between all older stores and the load is output to the PE 220, which identifies the youngest of the older stores. The PE 220 outputs an STLI hit entry signal 265 which, if it is not the same as the STLF hit entry 275, indicates that the load cannot forward from the store because there is only a partial overlap and/or the starting address of the load and the store is not the same. The load marks this store, from which it is unable to forward as an interlocking store. The load will be unable to complete execution successfully and will be stalled, until the marked interlocking store identified by the STLI hit entry signal 265 commits (is written into the DC 145).
The conventional STLF scheme described above enables uncommitted store data to be forwarded to a load, as long as the store starting address matches that of the load, and the store size is greater than or equal to that of the load. Any other overlapping store, from which a load is unable to forward, results in the load interlocking with the store and causing it to stall execution until the store commits, potentially resulting in a degradation in performance.
The STLF logic is one of the critical paths in the processor 100, and hence may limit the maximum frequency to which the processor 100 may be scaled. As a consequence, it is necessary to keep the logic simple, so that it does not inhibit frequency scaling.