A relatively frequently executed x86 instruction set architecture instruction is a REP MOVS. This instruction instructs the microprocessor to move a string of data from a source location in memory to a destination location in memory. This instruction has been implemented in microcode. If the number of bytes to be moved is relatively large, the microcode employs a “fast string move” microcode routine to implement the instruction. The fast string move code performs a series of load-store micro-op pairs. The fast string move code attempts to perform large loads and stores (e.g., 16 bytes) since they are more efficient, i.e., loads and stores that are larger than the size of each data element specified by the REP MOVS[B/W/D/Q] (i.e., byte, word, double-word, quad-word).
However, the fact that the loads typically miss in the cache makes the REP MOVS relatively slow because the system memory accesses to read the cache lines specified by the loads have a long latency.