A pipelining technique is typically used in a Reduced Instruction Set Computing (RISC) processor to divide the instruction processing into a series of stages of a pipeline. As instructions flow through the instruction pipeline, each stage performs a different function. More than one instruction may be processed at the same time, with each instruction being processed in a different stage of the pipeline. The instruction advances through the pipeline stages at a clock rate which is determined by the slowest stage in the pipeline. A new instruction can be started every clock cycle in contrast to a non-pipelined processor in which processing of a new instruction cannot commence until processing of the previous instruction is complete. Processor throughput is a function of (i) pipeline stage clock speed; (ii) pipeline utilization or “efficiency” during normal execution: and (iii) the number of pipeline stalls. A superscalar RISC processor further increases throughput by allowing multiple instructions to be issued simultaneously and dispatched in parallel to multiple execution units.
An instruction pipeline often stalls due to resource constraints and inter-instruction data dependencies. An inter-instruction data dependency results in a stall when a later issued instruction requires a result produced by an earlier instruction that has not yet completed. The later issued instruction is thus stalled in the pipeline until the result of the first instruction is available. Stalls can also occur due to inadequate buffering of store data. Store ordering can be complicated in multi-core cache coherent memory chips/systems because coherent memory buses may be highly-pipelined and may separate address, data, and commit operations.