A server virtualization technology has enabled use of a computer as a plurality of computers by operating a plurality of independent operating systems (OS) on the computer as virtual machines (VM). Increase in a number of central processing unit (CPU) cores and a memory size that are equipped on a server increases a number of virtual machines operating on the server, and in some cases, may enable over 100 virtual machines to operate on a server.
Such a server virtualization technology provides each virtual machine with virtual hardware, that is, a virtual CPU, a virtual memory, a virtual disk device, a virtual I/O device, and the like, and each virtual machine uses the virtual hardware as real hardware.
Accordingly, when each virtual machine uses the virtual hardware, a management program called a hypervisor or a virtual machine monitor (VMM), the program being a main component of the server virtualization technology, traps an access to hardware by the virtual machine and performs a suitable access to a resource in real hardware. Accordingly, when a virtual machine performs some type of processing, overhead by the virtual machine monitor is added, and therefore performance of a virtual machine is generally lower than that of an OS operating on real hardware.
In order to avoid overhead, technologies of directly allocating a real hardware resource to a virtual machine are proposed and implemented. One of the technologies is peripheral component interconnect (PCI) pass-through. The PCI pass-through enables a specific virtual machine to directly access an I/O device connected to a PCI bus or a PCI-Express fabric, and enables reduction of overhead by the virtual machine monitor. The PCI pass-through technology will be described below, and it is assumed in the present description that an I/O device refers to an endpoint in PCI-Express.
However, in the PCI pass-through, there is a problem that a target I/O device is occupied by a specific virtual machine and therefore cannot be used by another virtual machine. Consequently, a technology of making an I/O device directly accessible from a plurality of virtual machines is proposed. The technology is single root-I/O virtualization (SR-IOV). An I/O device compatible with the SR-IOV includes a plurality of host interfaces and enables sharing of a single I/O device by a plurality of virtual machines by allocating each host interface to a virtual machine. The SR-IOV is often employed in an Ethernet (registered trademark) network interface controller (NIC) and is rarely employed in I/O devices other than a NIC, such as a disk controller and a graphics card.
The PCI pass-through and the SR-IOV (the technologies are hereinafter collectively referred to as “pass-through technologies”) have not only an advantage in performance but also a functional advantage that a function included in I/O device hardware can also be used by a virtual machine. Virtual hardware normally used by a virtual machine often simulates older generation hardware and generally only has a simple function. For example, a transmission control protocol (TCP) offloading function included in a high-end Ethernet (registered trademark) NIC cannot be used. The PCI pass-through technology allocates an I/O device directly to a virtual machine, and therefore such an offloading function becomes available to the virtual machine.
By use of FIG. 24, initialization of a PCI-Express fabric will be described. FIG. 24 is a diagram illustrating an example of a PCI-Express fabric.
As illustrated in FIG. 24, a PCI-Express fabric 200 has a configuration including a PCI-Express root complex 201, PCI-Express endpoints 202 to 207, and PCI-Express switches 208 and 209 that are connected through PCI-Express links 210 to 217, with the PCI-Express root complex 201 as a root of the fabric.
The PCI-Express root complex, the PCI-Express endpoint, and the PCI-Express switch are collectively referred to as PCI-Express devices, in the present description. Although an initialization method described below is a method performed by a common personal computer (PC) and a general-purpose OS such as Linux (registered trademark), another initialization method may be employed.
First, when a PC is turned on, a basic input/output system (BIOS) or an OS searches the PCI-Express fabric 200. The search is performed for detecting and setting every PCI-Express device in the PCI-Express fabric 200. PCI-Express identifies a PCI-Express device by three numbers (a bus number [0 to 255], a device number [0 to 31], and a function number [0 to 7]) called bus-device-function (BDF). The function number is a number used for identifying each function when the same PCI-Express device has a plurality of functions.
The PCI-Express root complex 201, and the PCI-Express switches 208 and 209 will be described.
FIG. 25 is a simplified block diagram illustrating an internal configuration of the PCI-Express root complex 201.
Referring to FIG. 25, the PCI-Express root complex 201 includes a PCI compatible host bridge device, PCI-PCI bridges (root PCI-Express ports), and a root complex register block (optional).
In order to connect the components, the PCI-Express root complex internally consumes one bus number. Since the PCI-Express root complex is a device located at a root of the PCI-Express fabric, the bus number to be consumed is “0.”
FIG. 26 is a simplified block diagram illustrating an internal configuration of the PCI-Express switches 208 and 209.
Referring to FIG. 26, each of the PCI-Express switches 208 and 209 includes a PCI-PCI bridge (upstream PCI-Express port) and a PCI-PCI bridges (downstream PCI-Express ports). In order to connect the components, each of the PCI-Express switches 208 and 209 internally consumes one bus number. The upstream refers to a direction getting closer to the PCI-Express root complex 201 side on the PCI-Express fabric. The downstream refers to a direction moving away from the PCI-Express root complex 201 on the PCI-Express fabric. Although FIG. 26 illustrates a case of two downstream ports, there may be three downstream ports.
In PCI-Express, a connection between PCI-Express devices is a point-to-point connection by a switch rather than a bus connection, and therefore only one PCI-Express endpoint or a PCI-Express switch is connected to a PCI-to-PCI bridge on the PCI-Express root complex 201 or each PCI-PCI bridge (downstream PCI-Express port) on the PCI-Express switches 208 and 209, and different bus numbers are allocated to the respective links.
At a search, the search is performed from the bus number 0. An initialization program such as a BIOS or an OS performs processing of reading a vendor identification (ID) of a PCI-Express device on each device number with the bus number 0. The vendor ID is saved in a register group called a PCI configuration space in a PCI-Express device. The value not being 0xFFFF (0x is a prefix denoting a hexadecimal number) indicates that some PCI-Express device is connected.
Next, when connection of some PCI-Express device is detected (the PCI-Express endpoints 202 and 203, and the PCI-Express switch 208 in FIG. 24), the initialization program executes reading of a class code on the PCI-Express device. The class code is also saved in the PCI configuration space. The class code tells a type of a PCI-Express device such as whether the device is a device for image output. The class code indicating a device connecting links with different bus numbers, such as a PCI-Express switch, tells a possibility of another PCI-Express device existing downstream of the bus number currently in the search.
Next, when a detected PCI-Express device is a PCI-Express endpoint (the PCI-Express endpoints 202 and 203 in FIG. 24), the initialization program allocates an I/O area and a memory area to the PCI-Express endpoint. The allocation is provided by setting of a base address register (BAR) included in a PCI configuration space in the PCI-Express device.
There are a maximum of six BARs from 0 to 5 included in a PCI-Express endpoint, and a BAR holds information about an I/O area and a memory area, the information being required by the PCI-Express endpoint. The initialization program writes 0xFFFFFFFF into the BAR 0 and reads a value of the BAR 0. Then, depending on the read value, which of an I/O area and a memory area is requested, what area size is required, and the like become clear. In accordance with the request, the initialization program writes a base address into the BAR 0. A range from the base address to a value obtained by adding the size requested by the PCI-Express endpoint to the base address is an I/O area or a memory area allocated to the PCI-Express endpoint. The BAR setting is set in such a way that there is no overlap between PCI-Express endpoints. The initialization program performs similar processing on the BARs 1 to 5. Additionally, the initialization program also performs setting of a command register, a cache line size register, and a latency timer register in the PCI configuration space.
Next, when a detected PCI-Express device is a bridge device such as a PCI-Express root complex and a PCI-Express switch (the PCI-Express switch 208 in FIG. 24), the initialization program first performs setting of a BAR similarly to the PCI-Express endpoint. In the case of a bridge device, there are a maximum of two BARs from 0 to 1.
Then, the initialization program performs setting of a command register, a cache line size register, and a latency timer register in a PCI configuration space in the bridge device. Additionally, the initialization program performs setting of a primary bus number register, a secondary bus number register, and a subordinate bus number register. The primary bus number refers to a number of a bus existing on the upstream side of the local bridge device, and the secondary bus number refers to a number of a bus existing on the downstream side of the local bridge device. The subordinate bus number indicates a bus number of a link with a maximum bus number, out of links existing downstream of the local bridge device. Since detection of every device is not completed at this point, 0xFF being the maximum value is set to the subordinate number.
Then, the initialization program performs setting of a memory base address register and an I/O base address register. The memory base address register indicates a starting address of a memory space allocated to the secondary bus side, and the I/O base address register indicates a starting address of an I/O space allocated to the secondary bus side.
Subsequently, the initialization program performs a search for a PCI-Express device connected downstream of the device. The search of the downstream side is recursively performed, and when the search is completed, final subordinate bus numbers are determined in order of the PCI-Express switches 209 and 208. Further, values of a memory limit address and an I/O limit address are determined, and the values are stored in a suitable location in a PCI configuration register. The memory limit address is a size of a memory space allocated to the secondary bus side, and the I/O limit address is a size of an I/O space allocated to the secondary bus side. When recursively searching the downstream side, the initialization program is able to calculate sizes of a memory space and an I/O space allocated to each link by holding a set value of a BAR set to a device existing in each link. Thus, values of a memory limit address and an I/O limit address in a bridge device on the upstream side are obtained.
Thus, the initialization program completes the setting of the PCI-Express fabric. In the PCI-Express fabric 200 illustrated in FIG. 24, the initialization program first performs setting of the PCI-Express root complex 201. Next, the initialization program performs setting of the PCI-Express endpoint 202, the PCI-Express endpoint 203, the PCI-Express switch 208, the PCI-Express endpoint 204, the PCI-Express switch 209, the PCI-Express endpoint 206, the PCI-Express endpoint 207, and the PCI-Express switch 209 in this order. In the last setting performed on the PCI-Express switch 209, an I/O limit address, a memory limit address, and a subordinate number are set. Next, the initialization program performs setting of the PCI-Express endpoint 205 and the PCI-Express switch 208 in this order. In the setting of the PCI-Express switch 208, an I/O limit address, a memory limit address, and a subordinate number are set. Subsequently, the initialization program sets an I/O limit address, a memory limit address, and a subordinate number with respect to the PCI-Express root complex 201.
In the case of PCI-Express, when an initialization program accesses a PCI configuration space in a PCI-Express device, the initialization program issues a configuration read request or a configuration write request. A BDF number is written into each of the requests as information for identifying a destination device. The PCI-Express device side holds a destination BDF number included in a configuration write request as a BDF number of the local device.
The BDF number is written into a request as information indicating a source PCI-Express device when a request such as a memory read request and a memory write request is issued from the PCI-Express device side.
It is anticipated that, in the future, when a number of virtual machines operating on a server gradually increases, and one or more I/O devices are to be allocated to each virtual machine by the PCI pass-through or the SR-IOV, a required number of I/O devices may not be provided in an enclosure of a server or a PC. Although five or six I/O devices may be generally equipped in a 2U server (a thickness of the enclosure being 8.89 cm), the number of I/O devices that can be equipped is overly small compared with a number of virtual machines.
In view of the situation described above, a technology of making I/O devices connectable to a PC or a server, a number of the I/O devices being more than a number of previously provided PCI-Express slots, has been developed. Specifically, by extending a PCI-Express fabric previously existing only inside an enclosure of a server or a PC to outside the enclosure and connecting an I/O box equipped with I/O devices to the PC or the server with a cable and a switch, the technology makes I/O devices connectable, a number of the I/O devices being more than a number of PCI-Express slots. Products based on such a technology are disclosed in NPLs 1 and 2.
A technology disclosed in NPL 1 extends a PCI-Express fabric to outside an enclosure by an I/O card simulating an upstream-side function of a PCI-Express switch, an I/O extension box simulating a downstream-side function, and Ethernet (registered trademark) connecting the components. A technology disclosed in NPL 2 extends a PCI-Express fabric to outside an enclosure by an I/O card and a cable that extend a PCI-Express signal.