1. Field of the Invention
This invention relates in general to data transfer within a computer system, in particular where data is transferred in units such as packets, and especially where the computer includes at least one software-implemented computer such as a virtual machine.
2. Description of the Related Art
Increasing the speed and efficiency of input/output (I/O) operations in computers is a constant goal of developers of both hardware and software. Working against this goal is the tendency of computer systems to increase in complexity, which in turn increases the burden on system software, such as the operating system, that must try to meet the increased demand on I/O subsystems and bandwidth.
One way to maximize the speed of an I/O transfer would of course be for the operating system to devote itself entirely to a current I/O operation, putting all other processes on hold until the transfer is completed. Similarly, an I/O channel, such as a data bus or network, could be devoted entirely to one transfer at a time. Such a dedicated arrangement is impractical, inflexible, and unworkable in modern multi-tasked computer systems. Depending on the type of I/O device involved, such an arrangement might also waste available I/O bandwidth.
In order to increase the flexibility not only of scheduling but also of routing, especially between systems, many I/O operations now operate with basic I/O units that are often referred to as “packets.” In this widespread scheme, blocks of data to be transmitted are first converted into a sequence of smaller packets according to a predefined protocol. Each packet may then be transmitted individually. As part of the protocol, additional information (a “header”) is therefore typically added to each packet in order to describe characteristics of the data it contains, such as the destination, as well as information to aid in error detection, reassembly of the original data block, etc. The header information, assuming it's included at all, will be determined by the given transfer protocol. Examples of systems that transfer data using packets include the Internet, digital packet-switched telephone networks, and the large variety of Universal Serial Bus (USB) devices.
FIG. 1 illustrates the main components of a conventional computer system. System hardware 100 includes one or more central processors CPU(s) 110, which may be a single processor, or two or more cooperating processors in a known multiprocessor arrangement. As in most computers, two different types of data storage are commonly provided: system memory 112, typically implemented using any of the various RAM technologies, and a usually higher-capacity storage device 114 such as one or more disks. The hardware usually also includes, or is connected to, conventional registers, interrupt-handling circuitry, etc., as well as a memory management unit MMU 116, which provides support for such operations as memory tracing. System software 200 either is or at least includes an operating system OS 220, which will include drivers 222 as needed for controlling and communicating with various devices, including the disk 114. Applications 300 are installed to run on the hardware 100 via the system software 200.
FIG. 1 also illustrates various devices 400-1, 400-2, . . . , 400-m, which may share and be connected to the rest of the system by a communication channel 450 such as a bus or network. In other cases, individual devices may have dedicated connections to the system hardware 100. It is assumed here that data transfer between the system hardware 100 and each device 400-1, 400-2, . . . , 400-m takes place in units such as packets; other types devices may of course also be connected to the hardware 100, both directly and via the network.
For the purpose of data transfer, each device 400-1, 400-2, 400-m will have associated with it at least one communication channel, whether shared or dedicated. Each device may be considered to be a separate “target” or “destination” when it comes to data transfer. A hardware device controller 140 is also typically included for each device, or for each group of devices that share a common channel 450. In FIG. 1, only one such device controller 140 is shown, merely for the sake of simplicity. A driver 224 is also loaded in the operating system in order to support the hardware controller 140.
Assume by way of a very common example that the devices 400-1, 400-2, 400-m are USB devices, so that data transfer over the channel (in USB contexts, a bus) 450 is to take place according to the well known USB protocol. Whenever some “source” sub-system or process, such as an application, initiates a request for transfer of a block of data D to a USB device, that is, an OUT operation, it establishes a memory buffer 130 in which it stores the data D. The source subsystem then generates a corresponding transfer request to indicate to the controller driver 224 that it should begin the procedure (described below) for transferring the data set D. The buffer is also established for data that is to be input from the USB device that is, for an IN operation. Note that, in other systems, according to other protocols, the controller driver 224 may be responsible for establishing the buffer 130.
See also FIG. 2. The controller driver 224 then splits the source's data request into sub-blocks whose size is chosen to be consistent with bus bandwidth requirements and bus (for example, USB) protocol mechanisms. For the sake of illustration, in FIG. 2, the source data set, that is, the data block D, is shown as being subdivided into three sub-sets or “sub-blocks” D1, D2, and D3. In most practical cases, the number of sub-blocks will be much greater, depending on the size of the original data set D. Each sub-block D1, D2, and D3 of the source data block D is used as the basis for a single “transaction,” which results in the data sub-block being transferred from the source's buffer 130 to the USB device, or vice versa. The transfer procedure is typically the same regardless of the number of transactions. For the most common case of bulk transfers, each data sub-block D1, D2, D3, etc., is typically limited to either 32 or 64 bytes.
The “raw” data sub-sets D1, D2, D3, etc., alone are generally not enough to adequately define the parameters of a desired transfer. Rather, each sub-set is usually included in or referenced by another data structure that also specifies such information as the destination, the direction of transfer (IN or OUT), the size of the data sub-set to be transferred, etc. In the USB context, the data structures used for this purpose are known as “transfer descriptors” (TDs). TDs are arranged usually as a linked list, and each TD also includes a pointer to the memory location of the respective actual data sub-set to be transferred, as well as a pointer to the next TD. Similar descriptors are usually also created for data transfer using other protocols. Continuing with the example of transfer according to the USB protocol, the driver 224 then builds a list of pending transactions that are targeted for one or more USB devices attached to the bus 450. Each TD defines one transaction. In FIG. 2, transfer descriptors TD1, TD2, and TD3 are shown as having been created for the respective data sub-blocks D1, D2, and D3. The TDs are also stored in memory, for example, in a memory space, that is, a TD buffer 131, established for the purpose.
In USB-based systems, at a predefined interval, the controller 140 begins to take TDs as inputs, usually (but not necessarily) one at a time, and from each TD and its respective data sub-block creates a data structure known as a “packet.” The controller 140 then transfers the packets sequentially to the bus 450 via a hub 141. The concept of a “packet” has somewhat varying definitions in the literature, but is used here to refer to the data structure(s) used to transfer a single data sub-block D1, D2, and D3 to or from at least one destination via the bus 450. According to some other non-USB protocols, the controller begins the process of TD-to-packet conversion only after receiving a specific instruction to do so from the driver 224.
Each USB transfer typically consists of at least three phases, namely, a “token packet” phase,” a “data packet” phase and a “handshake packet” phase. The token packet phase begins each transaction. It defines the type of transaction (IN or OUT), and, where the transaction targets a specific device, it normally includes the device address. Following the token packet comes the data packet, which is the data sub-block (D1, D2, D3, etc.) that currently is to be sent.
In order to guarantee data delivery, during the handshake packet phase, the target device returns to the sender (here: controller 140) information indicating whether the transaction was successful, whether it failed, or whether the intended target device was busy. If no signal is transmitted back to the controller within a predetermined time, then the controller 140 assumes that the packet transfer failed. In the case of a failed packet transfer, assuming any information is returned at all, the returned information normally includes at least the number of bytes that transferred successfully before the failure, and also usually a flag indicating what the error was. In the case of a busy device, the controller 140 typically attempts to resubmit the packet, and may continue to do so until the transfer is success or fails.
In the following discussion, the term “packet” is used to refer to the entire multi-phase transfer procedure described above for each data sub-block to be sent. In other words, as used here, a “packet” is all the information that passes on the bus 450 between the controller 140 and a target device in order to transfer a data sub-block, as defined by one transfer descriptor (if TDs are used at all), either to or from the device. Moreover, even though it may comprise several phases, a packet may be considered a unit, since all of its parts are required for a transfer to succeed. In the figures, packets are illustrated as small rounded boxes labeled with a “p.” In FIG. 2, a packet pi is shown as having been created based on transfer descriptor TDi, which in turn was generated for the data sub-block Di.
As is mentioned above, at least in USB contexts, the controller 140 retrieves TDs and creates corresponding packets for transfer at specified intervals. In USB-based systems, each such interval is commonly referred to as a “frame.” All the TDs (or all that can be handled in the frame) available at the beginning of each frame are used to create the packets that the controller tries to transfer during that frame. In USB systems, TDs are fetched and converted into packets at 1 ms intervals. Note that the bus 450 is typically shared, so that it cannot simultaneously carry more than one set of data at a time.
The controller 140 (usually via the driver 224) typically also indicates to the sub-system (such as one of the applications 300) that originally requested the transfer of data set D whether the requested transfer succeeded or failed. Such acknowledgement usually also takes place at 1 ms intervals. Using the USB protocol, the time it takes from when the transfer-requesting sub-system first submits the data D for transfer until it gets notification of the success or failure of the transfer is therefore roughly 2 ms. In short, each transmit/receive frame takes about 2 ms to complete, at which time the controller 140 can process the next frame of transactions (TDs). Communications procedures and mechanisms similar to those just described in the USB context are found in systems that operate according to other protocols.
Input of data from a device, that is, an IN operation, is also carried out in the form of packets, with the same protocol. As with OUT operations, TDs are generated that define the destination, buffer address, etc. of a data sub-set, but the result of transmission of a packet derived from such a TD is that the data sub-set is input from the destination and placed in the buffer 130. In short, input of a packet of data is handled in essentially the same manner as packet output, at least with respect to the data phase of the packet (the token and handshake phases need not be reversed), but the direction in which the corresponding data sub-set is transferred is the opposite.
The description above relates to conventional computer systems, but applies also, with some extensions, to virtualized computer systems. As in conventional non-virtualized systems, computers with virtualization build upon hardware and system software layers. FIG. 1 also shows the main components of a typical virtualized computer system, which includes the underlying system hardware platform 100, the system software 200, and at least one software construct known as a “virtual computer” or “virtual machine” (VM) 500.
As is well known in the art, a virtual machine (VM) is a software abstraction—a “virtualization”—of an actual physical computer system. As such, each VM will typically include virtualized (“guest”) system hardware 501 and guest system software 502, which are software analogs of the physical hardware and software layers 100, 200. Note that although the hardware “layer” 501 will be a software abstraction of physical components, the VM's system software 502 may be the same as would be loaded into a “real” computer. The modifier “guest” is used here to indicate that the various VM software components, from the perspective of a user, are independent, but that actual execution is carried out on the underlying “host” hardware and software platform. The virtual system hardware 501 includes one or more virtual CPUs 510 (VCPU), virtual system memory 512 (VMEM), a virtual disk 514 (VDISK), and virtual devices 539 (VDEV), all of which are implemented in software to emulate the corresponding components of an actual computer. Of particular relevance here is that the VM will include a virtualized controller 540 (VCTRL) having the same functions as the hardware controller 140, that is, the generation and handling of packets (or analogous transfer units) and coordination of packet transfer with the communication channel.
The guest system software 502 includes a virtual or “guest” operating system 520 (guest OS, which may, but need not, simply be a copy of a conventional, commodity OS), as well as drivers 522 (DRVS) as needed, for example, to control the virtual devices 539. Of particular relevance here is that a driver 524 is included for the virtual controller 540 (itself a virtual device); the driver 524 operates in the same manner as the driver 224.
Of course, most computers are intended to run various applications, and a VM is usually no exception. Consequently, by way of example, FIG. 1 illustrates one or more applications 503 installed to run on the guest OS 520; any number of applications, including none at all, may be loaded for running on the guest OS, limited only by the requirements of the VM. If the VM is properly designed, then the applications (or the user of the applications) will not “know” that they are not running directly on “real” hardware. Of course, all of the applications and the components of the VM are instructions and data stored in memory, just as any other software. The concept, design and operation of virtual machines are well known in the field of computer science. FIG. 1 illustrates a single VM 500 merely for the sake of simplicity; in many installations, there will be more than one VM installed to run on a common hardware platform; all will have essentially the same general structure, although the individual components need not be identical.
Some interface is usually required between the VM 500 and the underlying “real” or “host” OS 220 and hardware 100, which are responsible for actually executing VM-issued instructions and transferring data to and from the actual, physical memory and disk 112, 114. In this context, the “host” OS means either the native OS 220 of the underlying physical computer, or whatever system-level software handles actual I/O operations, takes faults and interrupts, etc.
One advantageous interface between the VM and the underlying system software layer and/or hardware is often referred to as a virtual machine monitor (VMM). Virtual machine monitors have a long history, dating back to mainframe computer systems in the 1960s. See, for example, Robert P. Goldberg, “Survey of Virtual Machine Research,” IEEE Computer, June 1974, p. 54–45. A VMM is usually a thin piece of software that runs directly on top of a host, such as the system software 200, or directly on the hardware, and virtualizes all the resources of the (or some) hardware platform. The VMM will typically include a software module 640 for emulating devices, as well as modules for such functions as memory management 616, etc. The interface exported to the respective VM is usually such that the virtual OS 520 cannot determine the presence of the VMM. The VMM also usually tracks and either forwards (to the host OS 220) or itself schedules and handles all requests by its VM for machine resources as well as various faults and interrupts. The general features of VMMs are known in the art and are therefore not discussed in detail here.
In the figures, a VMM 600 is shown acting as the interface for the single VM 500. It would also be possible to include each VMM as part of its respective VM, that is, in each virtual system. Although the VMM is usually completely transparent to the VM, the VM and VMM may be viewed as a single module that virtualizes a computer system. The VM and VMM are shown as separate software entities in the figures for the sake of clarity. Moreover, it would also be possible to use a single VMM to act as the interface for more than one VM, although it will in many cases be more difficult to switch between the different contexts of the various VMs (for example, if different VMs use different virtual operating systems) than it is simply to include a separate VMM for each VM. This invention described below works with all such VM/VMM configurations.
The important point is simply that some well-defined, known interface should be provided between each installed VM 500 and the underlying system hardware 100 and software 200, and that this interface should contain the components of the invention described below. Consequently, instead of a VMM, with respect to the data transfer procedures described below, the interface between the VM (or other software-implemented computer) and the host software and hardware could be one or a group of device emulation sub-systems.
In some configurations, such as the one illustrated in FIG. 1, the VMM 600 runs as a software layer between the native system software 200 and the VM 500. In other configurations, the VMM runs directly on the hardware platform 100 at the same system level as the host operating system (host OS). In such case, the VMM typically uses the host OS to perform certain functions, often I/O, by calling (usually through a host API—application program interface) the host drivers 222. In this situation, it is still possible to view the VMM as an additional software layer inserted between the hardware 100 and the guest OS 520. Furthermore, it may in some cases be beneficial to deploy VMMs on top of a thin software layer, a “kernel,” constructed specifically for this purpose. Compared with a system in which VMMs run directly on the hardware platform, use of a kernel offers greater modularity and facilitates provision of services that extend across multiple virtual machines (for example, resource management). Compared with the hosted deployment, a kernel may offer greater performance to because it can be co-developed with the VMM and be optimized for the characteristics of a workload consisting of VMMs. The invention described below may be used in all these different configurations.
In addition to controlling the instruction stream executed by software in virtual machines, the VMM also controls other resources in order to ensure that the virtual machines remain encapsulated and do not interfere with other software on the system. First and foremost, this applies to I/O devices, but also to interrupt vectors, which generally must be directed into the VMM (the VMM will conditionally forward interrupts to the VM). Furthermore, the memory management (MMU) functionality normally remains under control of the VMM in order to prevent the VM from accessing memory allocated to other software on the computer, including other VMs. In short, the entire state of the VM is not only observable by the VMM, but the entire operation of the VM is under the control of the VMM.
According to the prior art, packet-based data transfer between a source (such as one of the applications 503) within the VM and a physical device (destination) is essentially the same as described above in the non-virtualized context, with the exception that the transfer is “duplicated”: The source data block D is first transferred (usually, copied) from the transfer-requesting source process into a buffer 530, which is normally established by the source process itself (the normal case) but could alternatively be established by the driver 524. The driver 524, which is analogous to (and in many cases an identical copy of) the driver 224, then builds a list of TDs from the buffered data and stores the TDs in, for example, a memory space 531.
The virtual device controller 540 (a software analog of the controller 140) then constructs packets from the TDs and corresponding data sub-blocks, and passes them sequentially to what it “believes” is a channel (such as 450). In fact, however, the VM-issued packets are received (in particular, intercepted) by an emulated bus 650 (see FIG. 3) within the VMM. The VMM in turn passes each VM-issued packet to the system software and hardware, which places the (or a corresponding) packet on the “real” bus 450. Note that the device to which (or from which) the packets are to be sent (or received) is typically one of the physical devices 400-1, 400-2, . . . , 400-m, although these may also be emulations.
As can be understood from the discussion above, with respect to packet-based transfer, the VM is designed and intended to act just like a conventional non-virtualized system, the major structural difference being that the various hardware components involved, including the controller 540 and the channel, are implemented in software. Again, with respect to packet transfer, the VM/VMM interface is essentially a software “copy” of a hardware 100/channel 450 interface.
In many cases, virtualization increases the ability of the system as a whole to adapt to new devices. For example, a given host OS 220 may not be able to communicate with the latest digital camera, either because the proper driver has not been installed, or because no driver is even available for the host OS 220: The host OS might be a version of Linux, for example, whereas the camera manufacturer might supply drivers only for Microsoft Windows operating systems. In this case, one could install the Windows OS in the VM as the guest OS 520 and run the VM on the Linux-based host OS 220. The camera driver can then be installed as one of the drivers 522 in the VM. Regardless of the guest OS/host OS relationship, what is then needed is some way for the camera driver to communicate with the actual device (the physical camera) without having to rely on any camera driver in the host OS 220. Techniques to accomplish such “packet pass-through” are known.
Whether in a “real” or virtualized system, and especially where packet pass-through is implemented, packet transfer in the prior art suffers from a similar problem, namely, delay. A major factor that contributes to delay, however, is the 2 ms (in USB contexts) interval at which new TDs are fully processed by the controller 140 (including reporting the results of the attempted transfers). This delay is especially problematic in virtualized (or other software-implemented) computer systems: In order to correctly emulate the actions of a bus, the VMM must return information to the virtual controller indicating the success or failure (or destination busy state) of each packet that the virtual controller sends. The VMM cannot indicate a transfer result for a packet, however, until it in turn receives indication of success or failure of the transfer of the packet by the hardware controller via the “real” bus 450. The VMM must therefore wait for the results of the “real” transfer before it can pass the results on to the virtual controller. Because the current VM-generated packet will be the only one the VMM presents for further processing and actual transfer by the hardware controller, only that one packet will be processed during the 2 ms two-way transfer period. The most common case is therefore also the worst case: It may take around 2 ms to completely transfer each single packet from the VM. In many cases, especially where many packets must be transferred, the resultant delay will be unacceptable; indeed, the accumulated delay may render some time-critical transfers altogether impossible.
What is needed is a mechanism that increases the speed with which packet-based I/O operations can be carried out, without violating the requirements of the given I/O protocol, especially where the request to transfer data must cross more than one transfer interface, in particular, where the source of the transfer request is in a software-implemented computer such as a virtual machine. This invention provides such a mechanism.