1. Field of the Invention
The present invention relates generally to data processing systems and more particularly to communications in a data processing system including multiple host computer systems and one or more adapters where the host computer systems share the adapter(s) and communicate with those adapter(s) through a PCI switched-fabric bus. Still more specifically, the present invention relates to a computer-implemented method, apparatus, and computer program product for translating bus/device/function numbers and routing communications packets that include those numbers through a PCI switched-fabric that utilizes PCI switches to enable multiple host computer systems to share one or more adapters.
2. Description of the Related Art
A conventional PCI bus is a local parallel bus that permits expansion cards to be installed within a single computer system, such as a server or a personal computer. PCI-compliant adapter cards can then be coupled to the PCI bus in order to add input/output (I/O) devices, such as disk drives, network adapters, or other devices, to the computer system. A PCI bridge/controller is needed in order to connect the PCI bus to the system bus of the computer system. The adapters on the PCI bus can communicate through the PCI bridge/controller with the CPU of the computer system in which the PCI bus is installed. Several PCI bridges may exist within a single computer system. However, these PCI bridges serve to couple multiple PCI buses to the CPU of the computer system in which the PCI buses are installed. If the single computer system includes multiple CPUs, the PCI buses can be utilized by the multiple CPUs of the single computer system.
A PCI Express (PCIe) bus is a recent version of the standard PCI computer bus. PCIe is based on higher speed serial communications. PCIe is architected specifically with a tree-structured I/O interconnect topology in mind with a Root Complex (RC) denoting the root of an I/O hierarchy that connects a host computer system to the I/O.
PCIe provides a migration path compatible with the PCI software environment. In addition to offering superior bandwidth, performance, and scalability in both bus width and bus frequency, PCI Express offers other advanced features. These features include QoS (quality of service), aggressive power management, native hot-plug, bandwidth per pin efficiency, error reporting, recovery and correction and innovative form factors, peer-to-peer transfers and dynamic reconfiguration. PCI Express also enables low-cost design of products via low pin counts and wires. A 16-lane PCI Express interconnect can provide data transfer rates of 8 Gigabytes per second.
The host computer system typically has a PCI-to-Host bridging function commonly known as the root complex. The root complex bridges between a CPU bus, such as HyperTransport™, or the CPU front side bus (FSB) and the PCI bus. Multiple host computer systems containing one or more root functions are referred to as a multi-root system. Multi-root configurations which share I/O fabrics have not been addressed well in the past.
Today, PCIe buses do not permit sharing of PCI adapters among multiple separate computer systems. Known I/O adapters that comply with the PCIe standard or a secondary network standard, such as Fibre Channel, InfiniBand, or Ethernet, are typically integrated into blades and server computer systems and are dedicated to the blade or system in which they are integrated. Having dedicated adapters adds to the cost of each system because an adapter is expensive. In addition to the cost issue, there are physical space concerns in a blade system. There is little space available in a blade for one adapter, and generally no simple way to add more than one.
Being able to share adapters among a number of host computers would lower the connectivity cost per host, since each adapter is servicing the I/O requirements of a number of hosts, rather than just one. Being able to share adapters among multiple hosts can also provide additional I/O expansion and flexibility options. Each host could access the I/O through any number of the adapters collectively available. Rather than being limited by the I/O slots in the host system, the I/O connectivity options include the use of adapters installed in any of the host systems connected through the shared bus.
In known systems, the PCIe bus provides a communications path between a single host and the adapter(s). Read and write accesses to the I/O adapters are converted in the root complex to packets that are transmitted from the host computer system, or a system image that is included within that host computer system, through the PCIe fabric to an intended adapter that is assigned to that host or system image. The PCIe standard defines a bus/device/function (BDF) number (B=PCI Bus segment number, D=PCI Device number on that bus, and F=Function number on that specific device) that can be used to identify a particular function within a device, such as an I/O adapter. The host computer system's root complex is responsible for assigning a BDF number to the host and each function within each I/O adapter that is associated with the host.
The BDF number includes three parts for traversing the PCI fabric: the PCI bus number where the I/O adapter is located, the device number of the I/O adapter on that bus, and the function number of the specific function, within that I/O adapter, that is being utilized.
A host may include multiple different system images, or operating system images. A system image is an instance of a general purpose operating system, such as WINDOWS® or LINUX®, or a special purpose operating system, such as an embedded operating system used by a network file system device. When a host includes more than one system image, each system image is treated as a different function within the single device, i.e., the host.
Each communications packet includes a source address field and a destination address field. These are memory addresses that are within the range of addresses allocated to the specific end points. These address ranges correlate to specific source BDF and destination BDF values.
Each packet transmitted by a host includes a destination address which corresponds to the mapped address range of the intended adapter. This destination address is used by the host's root complex to identify the correct output port for this specific packet. The root complex then transmits this packet out of the identified port.
The host is coupled to the I/O adapters using a fabric. One or more switches are included in the fabric. The switches route packets through the fabric to their intended destinations. Switches in the fabric examine the host-assigned adapter BDF to determine if the packet must be routed through the switch, and if so, through which output switch port.
According to the PCIe standard, the root complex within a host assigns BDF numbers for the host and for the adapters. The prior art assumes that only one host is coupled to the fabric. When only one host is coupled to the fabric, there can be no overlap of BDF numbers the root complex assigns since the single root complex is responsible for assigning all BDF numbers. If there is no overlap, switches are able to properly route packets to their intended destinations.
A root complex follows a defined process for assigning BDF numbers. The root complex assigns a BDF number of 0.0.1 to a first system image, a BDF number of 0.0.2 to a second system image, and so on.
Physical I/O adapters are typically virtualized such that a physical I/O adapter appears as multiple separate virtual I/O adapters. Each one of these virtual adapters is a separate function.
Each virtual I/O adapter is associated with a system image. One physical I/O adapter can be virtualized into virtual I/O adapters that are each associated with different system images. For example, if the host includes three system images, a physical I/O adapter can be virtualized into three virtual I/O adapters where each virtual I/O adapter is associated with a different system image. Further, a system could include several physical I/O adapters, each including one or more virtual adapters. The virtual I/O adapters would then be associated with the different system images of the single host. For example, a first physical I/O adapter might include a first virtual I/O adapter that is associated with a first system image of the host and a second virtual I/O adapter that is associated with a second system image of the host. A second physical adapter might include only a single virtual I/O adapter that is associated with a third system image of the host. A third physical adapter might include two virtual adapters, the first associated with the second system image and the second associated with the third system image.
If multiple hosts are simultaneously coupled to the fabric, there will be overlap of the BDF numbers that are selected by the root complexes of the hosts. Overlap occurs because each host will assign a BDF number of 0.0.1 to itself. Thus, a BDF that should identify only one function included in only one host will not uniquely identify just one function in just one host.
The root complex assigns a BDF number of 1.1.1 to the first function within a first adapter that the root complex sees on a first bus. This process continues until all BDF numbers are assigned.
Unique memory address ranges are assigned to each device as needed for that device to operate. These address ranges correspond to the assigned BDF numbers, but only the root complex maintains a table of the corresponding values, which it uses to route packets.
If multiple hosts are coupled to the fabric, each host's root complex will assign a BDF number of 1.1.1 to the first function within a first adapter that a root complex sees on a first bus. This results in the BDF number 1.1.1 being assigned to multiple different functions. Therefore, there is overlap of BDF numbers that would be used by the multiple hosts. In a similar fashion, the memory address ranges assigned on each host for its devices will overlap with the memory address ranges assigned on other hosts to their devices. When the BDF numbers and memory address ranges overlap, switches are unable to properly route packets.
Therefore, a need exists for a method, apparatus, and computer program product for address translation and routing of communications packets through a fabric that includes one or more host systems, each of which having one or more system images, communicating with one or more physical adapters, each of which providing one or more virtual adapters, through a fabric of interconnected multi-root switches.