In current multi-processor systems, including Chip Multi-Processors, it is common for an input/output (I/O) device such as, for example, a network media access controller (MAC), a storage controller, a display controller, to generate temporary data to be processed by a processor core. Using traditional memory-based data transfer techniques, the temporary data is written to memory and subsequently read from memory by the processor core. Thus, two memory accesses are required for a single data transfer.
Because traditional memory-based data transfer techniques require multiple memory accesses for a single data transfer, these data transfers may be bottlenecks to system performance. The performance penalty can be further compounded by the fact that these memory accesses are typically off-chip, which results in further memory access latencies as well as additional power dissipation. Thus, current data transfer techniques result in system inefficiencies with respect to performance and power.