A memory copying operation is a very often performed data processing operation. Such operations are initiated by software at the applications level, at the operating system (OS) level, and by middleware programming. Memory copying operations are typically programmed with repeated load and store operations that copy data from one location in memory to another location. This causes the data transfer to be staged through the Central Processing Unit (CPU or, more simply, “processor”). This results in inefficiency in the overall operation of the data processing system for the following reasons:                (1) the performance of the copy operation is limited by the memory bandwidth available, which is often insufficient to match the speed of the CPU;        (2) the data transfer is staged through the CPU via load and store instructions, essentially tying up the CPU for the duration of the move operation which stalls the CPU from working on other tasks;        (3) because the CPU is typically much faster than the memory subsystem, the CPU is idle as it waits for data to arrive from memory into the CPU.        
As can be seen from the above discussion, memory copy operations are performance sensitive procedures for applications, middleware and operating systems. Many methods for performing memory copy operations cause the data to be staged through a CPU by means of repeated load and store instructions. As indicated above, operations of this nature tie up the CPU for a relatively long duration of time, especially when large amounts of data are to be copied. Such operations are also slow since memory latency and memory bandwidth limitations result in slower overall transfer rates as compared to CPU speeds, thereby resulting in undesirable levels of performance.
However, some solutions do exist for memory copy operations in real mode for pinned pages (and hence real memory addresses), but none exist for general use by applications, by middleware and by operating systems. In other words, when a data processor is functioning in a virtual addressing mode, efficient memory copy operations are simply not possible or tolerated. Up until the advent of the present invention, it is only when real addressing modes are employed that efforts were undertaken to improve memory copy operation efficiency, and even then “pinning of pages” is required. Pinning is when memory is configured to prevent paging out of the data stored in that portion of memory. This ensures that page faults do not occur for data access to the temporary buffer. Another problem is that typical implementations of the store operation cause the destination cache line to be fetched from memory even though the entire cache line is ultimately rewritten. This also wastes undesirably large portions of the memory bandwidth.
Another source of inefficiency in traditional memory copying is poor data alignment. Typical computer systems are more efficient when loading and storing naturally aligned data. They are also more efficient when loading and storing larger granules of data (for example, 64-bit operations are more efficient than 32-bit operations). Unfortunately a large class of application software does not behave well when it comes to the natural alignment of data with respect to the memory subsystem. Instead, most application software relies on operating system (OS) instructions, such as bcopy or similar instructions, to effect memory copy operations. The bcopy routine has no knowledge of the application alignment behavior and must be designed to work efficiently under all alignment conditions.
Therefore a need exists to overcome the problems with the prior art as discussed above.