High-performance computing (HPC) applications typically execute calculations on computing clusters that include many individual computing nodes connected by a high-speed network fabric. Typical computing clusters may include hundreds or thousands of individual nodes. Each node may include one or more many-core processors, co-processors, processing accelerators, or other parallel computing resources. A typical computing job therefore may be executed by a large number of individual processes distributed across each computing node and across the entire computing cluster.
Processes within a job may communicate data with each other using a message-passing communication paradigm. For computing clusters using high-speed network fabrics, an increasingly large proportion of message processing time may be caused by internal latency associated with moving message data across I/O or memory buses of the individual computing node. Thus, overall performance may be improved by improving communication locality, that is, by delivering network data closer to the processor core or other computing resources of the computing node. Current technologies such as Intel® Data Direct I/O (DDIO) allow I/O devices such as network controllers to place data directly in a shared last-level cache, bypassing main memory. However, DDIO may not be used for systems lacking a shared last-level cache.