In order to reduce the latency associated with accessing data stored in main memory, processors typically use a memory hierarchy which comprises one or more caches. There are typically two or three levels of cache, denoted L1, L2 and L3 and in some examples the first two caches (L1 and L2) may be on-chip caches which are usually implemented in SRAM (static random access memory) and the third level of cache (L3) may be an off-chip cache. In other examples, such as in a System on Chip (SoC), all the memory may be implemented in the same piece of silicon. The caches are smaller than the main memory, which may be implemented in DRAM, but the latency involved with accessing a cache is much shorter than for main memory, and gets shorter within the hierarchy as you get closer to the processor. As the latency is related, at least approximately, to the size of the cache, a lower level cache (e.g. L1) is typically smaller than a higher level cache (e.g. L2), using the convention that the L1 cache is the lowest level cache.
When a processor, or more particularly an ALU (arithmetic logic unit) within a processor, accesses a data item, the data item is accessed from the first level in the hierarchy where it is available (i.e. from the level closest to the processor where it is available). For example, a look-up will be performed in the L1 cache and if the data is in the L1 cache, this is referred to as a cache hit. If however, the data is not in the L1 cache, this is a cache miss and the next levels in the hierarchy are checked in turn until the data is found (e.g. L2 cache, followed by L3 cache, if the data is also not in the L2 cache). In the event of a cache miss, the data is brought into the cache. The traversing of the memory hierarchy which results from a cache miss in the lowest level cache (e.g. L1 cache) introduces latency.
There are many scenarios where a processor is required to copy data elements, and in particular memory ranges (i.e. a group of concurrent addressed memory locations which hold data elements), from one location in memory (i.e. main memory) to another. The latency associated with traversing the memory hierarchy (described above) reduces the speed with which this copying of data elements can be achieved and the speed with which memory ranges can be copied may be seen as a performance indicator (or benchmark) for the processor.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known processors.