Before an Ethernet network interface controller (NIC) receives a packet, the host has pre-allocated a fixed size memory buffer for the NIC to transfer the packet into. After the packet arrives at the NIC, the NIC transfers the packet to host memory via a direct memory access (DMA) write. The host computer looks only at the packet data written by the NIC, and ignores the state of memory outside the packet in its receive buffers.
Host memory systems must by written to in units of a cacheline size, typically 64 or 128 bytes. To write a smaller value to memory, a host must read the cacheline, modify the data, and write it back to memory. This is called read-modify-write operation.
When a peripheral device, such as a peripheral component interface express (PCIe) based NIC, performs a DMA write to a host computer system and the DMA transfer is to an address which is not aligned on a cacheline boundary, or the DMA transfer is not a multiple of the cacheline size in length, the host system must perform a read-modify-write cycle where it first reads the cacheline surrounding the modified data, and modifies the cacheline to include the DMA transfer, and then writes the modified cacheline back to memory. This read-modify-write cycle is wasteful of host memory bandwidth, and can itself become a bottleneck preventing a device from working at full speed, since the host system is using twice as much memory bandwidth as would otherwise be required.
If a NIC wishes to avoid these read-modify-write operations on the host, it must pad received frames so that they are a multiple of cacheline size in length and start aligned on a cacheline boundary. This wastes PCIe bandwidth, as the device may be forced to send nearly 49% of its traffic as padding, in the worst case (65 byte packets padded to 128 bytes to accommodate a host with a 64 byte cacheline size).