The dramatic increase in networking speeds are causing processors to spend an ever increasing proportion of their time on networking tasks, leaving less time available for other work. High end computing architectures are evolving from Symmetric Multi-Processor (SMP) based designs to designs that connect a number of cheap servers with high speed communication links. Such distributed architectures typically require processors to spend a large amount of time processing data packets. Furthermore, emerging data storage solutions, multimedia applications, and network security applications are also causing processors to spend an ever-increasing amount of time on networking related tasks.
These bandwidth intensive applications typically use Transport Control Protocol (TCP) and Internet Protocol (IP) which are standard networking protocols used on the Internet, and the socket Application Programming Interface (API) which is a standard networking interface which is used to communicate over a TCP/IP network.
In order to efficiently utilize the bandwidth of a high speed link, TCP uses a sliding window protocol which sends data segments without waiting for the remote host to acknowledge previously sent data segments. This gives rise to two requirements. First, TCP needs to store the data until it receives an acknowledgement from the remote host. Second, the application must be allowed to fill new data in the memory buffer so that TCP can use the sliding window protocol to fill up the “pipe.” Note that the system can satisfy both of these requirements by copying data between the user memory and the kernel memory. Specifically, copying data between the user memory and the kernel memory allows the application to fill new data in the user memory buffer, while allowing the kernel to keep a copy of the data in the kernel memory buffer until it receives an acknowledgement from the remote host.
Hence, in many systems, whenever data is written to (or read from) a socket (an endpoint for communication between two machines), the system copies the data from user memory to kernel memory (or from kernel memory to user memory). Unfortunately, this copy operation can become a bottleneck at high data rates.
Note that, during a socket write or read operation, the system usually performs a Direct Memory Access (DMA) transfer to transfer the data between the system memory and a Network Interface Card (NIC). However, this data transfer is not counted as a “copy” because the DMA transfer has to be performed anyways (i.e., it has to be performed even if the data is not copied between the kernel memory and the user memory), and the DMA transfer does not burden the Central Processing Unit (CPU).
As discussed above, the system copies data from user memory to kernel memory or vice-versa. In high-speed networks in which the capacity of a network link approaches or exceeds the CPU's processing capacity, the CPU spends nearly all of its time copying transferred data, and thus becomes a bottleneck which limits the communication rate to below the link's capacity. As a result, a concept known as “zero-copy,” has come into being which describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. In this manner, zero-copy operations reduce the number of time-consuming mode switches between user space and kernel space. System resources are utilized more efficiently since using a sophisticated CPU to perform extensive copy operations, which is a relatively simple task, is wasteful if other simpler system components can do the copying.
To further optimize this approach, a strategy, commonly referred to as “copy-on-write,” is implemented so that data may not be copied between the user memory and the kernel memory or vice-versa. In the underlying kernel socket layer (kernel socket layer provides a convenient abstraction for communicating with remote applications) in kernel space, the kernel may mark the buffer in user space as copy-on-write. By marking the buffer “copy-on-write,” the pages in the buffer are treated as if they were read-only. When data is attempted to be written to these pages, the Memory Management Unit (MMU) raises an exception which is handled by the kernel, which allocates new space in physical memory to store the new data. Furthermore, control data is placed in a buffer in kernel space. If the application does not modify the buffer in user space until the underlying TCP layer (contains networking applications that communicate with other networking applications over a network) receives the acknowledgement that the data has been successfully transferred to the remote host, then the kernel socket buffer is freed and the copy-on-write mapping is revoked. As a result, the buffer in user space will not be copied.
However, if the application modifies the buffer in user space prior to the receipt of the acknowledgement, then the buffer in user space needs to be copied thereby losing the benefit of implementing the copy-on-write strategy. In many system applications, the application will reuse the buffer in user space immediately as there is no means of knowing how long the buffer in user space is marked copy-on-write.