Traditionally, as shown in FIG. 1(a), a computer system 10 including a processor device (CPU) executes an Operating System (O/S) 12 natively on the computer hardware that
is adapted for executing basic computer system functions in addition to controlling execution of one or more programs or processes 15. The computing system further includes two types of memory: a ‘real’ memory that comprises the actual physical memory (e.g., IC chips plugged into the computer) that is of a fixed size; and, a ‘virtual’ memory that is an abstraction provided by the O/S and memory management hardware (not shown), primarily for application usage. It is understood that part of virtual memory is backed (mapped) onto real memory, and may be part backed by non-volatile storage media, (e.g., one or more hard disk drives) allowing virtual memory size to exceed real (physical) memory size. A process 15 executing in the computer system 10 will thus have an associated virtual address space that is the logical view of how the process is stored in memory. The process may be located at a certain virtual address and may exist in contiguous memory. However, the corresponding physical memory may be organized according to a variety of memory organization schemes e.g., pages, frames, however, such entities assigned to a process may not be contiguous. The use of virtual memory thus requires implementation of the memory management hardware implementing a translation table or like memory mapping unit (not shown) that map program (virtual) addresses or pages to corresponding physical memory addresses or pages in real memory. It is a function of the 0/S 12 to ensure that the data and process a program is currently using is resident in real physical memory, and that the translation table can map the virtual address to the real address correctly.
The traditional computing system shown in FIG. 1(a) is configured to receive and send network packet communications over computer network connections according to various communications protocols, e.g., connection protocol e.g., Transmission Control Protocol (TCP/IP); or, connectionless protocols, e.g., Domain Name Service (DNS) implementing User Datagram Protocol (UDP), for example. Typically the computing system is provisioned with a network interface card (NIC) 20 that is designed to provide for computer communication over a computer network 99, e.g., a local and wide area network, a private network, a public network such as and the Internet. When a network packet 25 (comprising header and data portions) arrives at a NIC 20, the NIC moves the data into a pre-allocated location or buffer in the computer device's main memory.
Assuming that the network communications protocol (e.g., TCP, STCP, SCCP, TCP/IP) in use requires packet header checksums, the data may first be read so that it can be checksummed to ensure integrity of the communicated packet. As network state information 30 including the network connections established for receiving communications from the network is maintained at the 0/S, the 0/S determines from the header portion of the packet, the destination process of the received packet. Once the header (H) portion of the network packet has been fully processed and data checksums performed, the data (D) portion of the packet is then moved by the CPU into its final destination in memory for use by a receiving process 15, e.g., a web browser or other application. This final location may comprise a network I/O (socket) interface (the network communications end-point address). In all but the original move from the NIC to system memory, the CPU is responsible for moving the data providing a substantial processing overhead.
Likewise, using well known computer system 0/S and network I/O mechanisms, to send data over a network from an executing process 15, the computing system's 0/S will first receive a copy of the data (D) from the process and assemble one or more network packets 25 (each having header (H) and data (D) or payload portions) and perform a copy to the NIC 20 which forwards the packet over the network 99 according to a particular network protocol. Particularly, to write data to a network, an application will open a socket interface that represents an end point, and writes the data to the socket interface. The data goes through the network protocol stack, is modified by the 0/S (e.g., adding of protocol headers and/or breaking down into smaller packet segments) and is finally passed on to a driver representing the network interface controller (NIC) hardware. The NIC finally sends the data over the network.
More advanced NIC devices available today are enabled to perform the data packet assembly itself. By providing network interface control devices with network I/O hardware device acceleration-assist technology, hardware-assisted “zero-copy” sending and receiving functionality may be performed which reduces the load on system components. For example, TCP Segment Offload (TSO) and like technologies such as Intel Corp.'s I/O Acceleration Technology (I/OAT) may be implemented in a network interface card to enable increased throughput for many varieties of network traffic For a zero copy sending function, implemented in network I/O hardware device acceleration-assist technology enabled network card 21, as exemplified in FIG. 1(b), the 0/S informs an TSO-enabled network interface card 21 where to obtain the header (H) and data (D) from the physical memory. The hardware-assist provisioned network card 21, utilizing Direct Memory Access (DMA) mechanisms, then assembles the network packet 25. This process is referred to as “zero copy” sending, as a data copy to the 0/S has been eliminated. In some configurations, the hardware itself may calculate the IP and TCP checksums of the data so that the 0/S does not need to prepare those in the specified header portion of the packet. A switch in the network interface card implementing network I/O hardware device acceleration sets a flag or indicator informing the device driver that it can/cannot perform checksumming. If the operating system does not want the overhead of performing checksumming itself, it will inform the network card that it needs to perform checksumming computations for packets it will send.
As shown in FIG. 1(c), zero copy receiving works similarly by enabling a received packet 25 to be parsed by the hardware-assisted network card 21, which sends the header to (H) a predefined location in the 0/S. The 0/S, in response, is able to determine a target process which is to receive the data payload, and forwards the data (D) to a virtual memory address associated with that target process 15. In the case of connection based protocols, the 0/S will know which packets are likely to arrive, and thus the NIC card may be programmed to give the data payload directly to the user process. To perform this network I/O acceleration, a subset of network state information 30 maintained by the 0/S, e.g., connection status according to a particular communications network protocol such as Gigabit Ethernet and TCP/IP, is also maintained at the network card 21. Provision of such information enables it to more efficiently process the received packets. Moreover, for the DMA transfer to work, the 0/S must ensure that those pages of the target process which are about to receive the data are resident in physical memory (i.e., there must be a physical memory address that the network card can write to).
It is the case that modern computers are powerful enough to support virtualization whereby the hardware of a single computer (e.g., CPU, memory disk drives, NIC, etc.) is abstracted into different execution environments, referred to as Virtual Machines (VMs). As shown in FIG. 1(d), enabling the virtualized computing environment 100, is the implementation of a virtualization layer 40, e.g., virtualization software such as provided by Vmware, Inc. of Palo Alto, Calif. or, Xen (University of Cambridge Computer Laboratory). Such software executes under control of the host 0/S, such as Linux or Windows, and enables the host system 100 to run several “guest” operating systems 50 (guest 0/Ses) as independent virtual machines. Using the virtualization software 40, a host 0/S 12 creates an illusion that a guest process has its own processor with its own virtual memory, and additionally provides interfaces for the processes that are identical to the interfaces of the native hardware. While the virtualization software is depicted in FIG. 1(d) as running inside the host 0/S, it is executable as a separate application outside and under the control of the host 0/S. It is the case that the guest 0/S and guest process may reside in virtual memory. Consequently, the guest 0/S does not know where in physical memory the guest lives, or whether the virtual memory associated with the guest 0/S is even represented in physical memory. Moreover, as shown in FIG. 1(d), a virtual network 199 of guest virtual machines may exist.
Currently, the network I/O acceleration techniques depicted in FIGS. 1(b) and 1(c) do not work in the virtualized environment 100 such as shown in FIG. 1(d). That is, special considerations must be taken into account when implementing network I/O hardware device acceleration-assist technology (e.g., accelerated TCP segment offload (T50) or other hardware assisted TCP) in virtualized environments. For example, as shown FIG. 1(e), for the case of receiving network packets 25 via the NIC card 21 without hardware assist implemented in a virtualized environment, the host operating system 12 must first receive the packet from the NIC, analyze the packet, and determine which destination guest is to receive the packet. The host 0/S 12 must then send the packet to the guest 0/S receive (kernel) buffer, which is one memory to memory copy. Once the guest 0/S 50 receives the packet, the packet 25 must be analyzed to determine for which process 55 executing in the guest and socket buffer the data is and copies the data to the socket buffer of the right process, which is another memory to memory copy. As a result, virtualized network performance is not as good as network performance of the operating system running “natively” on the hardware.
Referring back to FIG. 1(d), in a virtual network 199 comprising a number of executing guest virtual machines (guest 0/Ses 50 and corresponding processes) existing on computing system 100, these guest processes may communicate with the host 0/S 12 and, further communicate with each other through the host 0/S via virtual interfaces provided by the virtualization layer 40. Currently, as shown in FIG. 1(f), when a first guest process 55a associated with a first virtual machine 50a sends data to a second guest process 55b, the guest 0/S on the sending side copies the data (D) into a network packet 25 that is formed by the first guest 0/S (a first data copy operation). The network packet 25 is then sent to the host 0/S 12 (a second data copy operation), and, subject to allowance after applying firewall rules by the host 0/S, the packet 25 is sent to the receiving guest 0/S (a third data copy), which then copies the data (D) to the destination receiving guest process 55b which requires a total of four copies of the data (and two packet header copies). As a result, high performance virtualized network performance is compromised.
In the case of Intel's I/OAT, network data is directly provided to a guest 0/S kernel in some of the virtual machines on a system; however not directly into a userspace. That is, the data has to go to an unsorted buffer, from where the guest OS needs to do the final copy of the data. In operation, the I/OAT network hardware can “split” the traffic into a limited number of streams, if the network card has N streams, N−1 can be directed to guest virtual machines and one stream has to go to the host 0/S, which then copies the data and headers to guest virtual machines. The host 0/S can point the streams from the network card at those virtual machines that have the highest rate of network traffic and, can periodically reevaluate which virtual machines do the most network traffic, and decide to point some of the network streams at other virtual machines. As part of this technology, the headers and data of the packets can be sent into different buffers. The hardware further includes an asynchronous memory copying unit, so the host 0/S could choose to: 1) receive all packets; 2) optionally, separate headers and data into different buffers; 3) process packets; 4) look for which virtual machine the packet is destined; 5) apply firewall rules; and, 6) have the mainboard chipset copy the data and header over to the virtual machine, while the CPU processes the next packet. Moreover, in this I/OAT technology, the host 0/S can choose to copy the data to kernel buffers in the guest OS, but have the headers sent to itself, in order to apply firewall rules.
It would thus be highly desirable to provide a system, method and computer program product that extends network I/O hardware device acceleration-assist (e.g., TSO and like hardware-assist acceleration I/O) to virtualized environments so that full zero copy network receive and send functionality may be implemented in such virtualized environments and, for eliminating one or more data copies on hardware that does not implement TSO or other I/O hardware device acceleration-assist technology.
It would further be highly desirable to provide a system, method and computer program product that extends hardware device acceleration-assist technology to virtualized environments, whereby the virtualization software implemented at the host 0/S emulates hardware accelerated network I/O operations enabling virtual machines present on the same computing system to communicate with the host and with each other without the overhead of excessive data copy operations.
It would further be highly desirable to provide a system, method and computer program product that extends hardware device acceleration-assist technology, e.g., TSO, to virtualized environments, providing full zero copy network sending on TSO and non-TSO enabled systems in virtualized environments; providing full zero copy network receiving on TSO enabled systems and eliminating one or more data copies on hardware not implementing TSO.