1. Field
The present invention relates generally to memory data transfers, and more specifically, to memory copies in processor-based systems.
2. Background
Microprocessors perform computational tasks in a wide variety of applications. A typical microprocessor application includes one or more central processing units (CPUs) that execute software instructions. The software instructions instruct a CPU to fetch data from a location in memory, perform one or more CPU operations using the fetched data, and store or accumulate the result. The memory from which the data is fetched can be local to the CPU, within a memory “fabric,” and/or within a distributed resource to which the CPU is coupled. CPU performance is often measured in terms of a processing rate, which may be measured as the number of operations that can be performed per second. The speed of the CPU can be increased by increasing the CPU clock rate, but because many CPU applications require fetching data from the memory fabric, increases in CPU clock speed without similar decreases in memory fabric fetch times (latency) will only increase the amount of wait time in the CPU for the arrival of fetched data.
For small copies, most memory copy algorithms utilize more CPU time in function call, size comparison, and looping overhead than in instructions that actually load and store data to and from memory. There is therefore a need in the art for more efficient copying of data from one location in memory to another location in memory.