The emergence of the cloud for computing applications has increased the demand for off-site installations, known as data centers, that store data and run applications accessed by remotely connected computer device users. Such data centers typically have massive numbers of servers, switches, and storage devices to store and manage data. A typical data center has physical rack structures with attendant power and communication connections. The racks are arranged in rows throughout the room, or rooms, of the data center. Each rack includes a frame that has slots or chassis between two side walls. The slots may hold multiple network devices such as servers, switches, and storage devices. There are many such network devices stacked in such rack structures found in a modern data center. For example, some data centers have tens of thousands of servers, attendant storage devices, and network switches. Thus, a typical data center may include tens of thousands, or even hundreds of thousands, of devices in hundreds or thousands of individual racks.
In order to efficiently allocate resources, data centers include many different types of devices in a pooled arrangement. Such pooled devices may be assigned to different host servers as the need for resources arises. Such devices may be connected via Peripheral Component Interconnect Express (PCIe) protocol links between the device and the host server that may be activated by a PCIe type switch.
Thus, many modern data centers now support disaggregate architectures with numerous pooled devices. An example of such a data center 10 is shown in FIG. 1A. A system administrator may access a composer application 12 that allows configuration data to be sent via a router 14 to a PCIe fabric box 20. The PCIe fabric box 20 includes numerous serial expansion bus devices, such as PCIe compatible devices that may be accessed by other devices in the data center. In this example, the PCIe fabric box 20 includes a fabric controller 22 that may receive configuration data through a network from the router 14. The fabric box 20 includes PCIe switches, such as the PCIe switches 24 and 26, that allow host devices such as host servers 30, 32, 34, and 36 to be connected to different PCIe devices in the fabric box 20. The PCIe switch 24 includes upstream ports 40 and 42, and the PCIe switch 26 includes upstream ports 44 and 46. The upstream ports 40, 42, 44, and 46 are connected via a cable to the host servers 30, 32, 34, and 36. The PCIe switch 24 also includes downstream ports 50, 52, 54, and 56. The PCIe switch 26 includes downstream ports 60, 62, 64, and 66. In this example, there are multiple devices in the fabric box 20 coupled to the respective downstream ports of the switches 24 and 26. These devices may be accessed by any of the host servers 30, 32, 34, and 36.
As shown in FIG. 1A, two host servers 30 and 32 are directly coupled to the upstream ports 40 and 42 of the switch 24, while two host servers 34 and 36 are directly coupled to the upstream ports 44 and 46 of the switch 26. In this example, devices 70, 72, 74, and 76 are directly coupled to the downstream ports 50, 52, 54, and 56 of the PCIe switch 24. Devices 80, 82, 84, and 86 are directly coupled to the downstream ports 60, 62, 64, and 66 of the PCIe switch 26. Additional devices and host servers may be supported by adding additional PCIe switches. The example system 10 allows certain system resources to be removed from host servers and provided by the outside pooled device box 20 instead. Thus, different types of system resources may be allocated to the needs of different servers as they arise. For example, the devices 70, 72, 74, 76, 80, 82, 84, and 86 may each be a resource such as a non-volatile memory (NVMe), a graphic processing unit (GPU), a field programmable gate array (FPGA), a network interface card (NIC), or other kinds of PCIe compatible devices. Each such device may be dynamically assigned to hosts, such as the host servers 30, 32, 34, and 36.
The major interface between the host servers and the fabric box is typically a Peripheral Component Interconnect Express (PCIe) standard. In FIG. 1A, four PCIe links from the two PCIe switches 24 and 26 may each support one of four PCIe lanes. Each of the PCIe lanes is thus assigned to one of the host servers 30, 32, 34, and 36. However, different implementations may be used to increase performance by increasing the maximum data exchanged between the host servers and the fabric box. For example, one link may support four PCIe lanes; two links may support eight PCIe lanes; or four links may support eight PCIe lanes. The number of lanes supported by the links depends on the performance requirement of the fabric box being used. In general, the greater number of lanes (the greater the link width) and thus greater transmission speed may be achieved.
FIG. 1B shows a fabric box 100 having a host interface of one link supporting 4 PCIe lanes. Like fabric box, device and host server elements are labeled with identical element numbers in FIG. 1B as in FIG. 1A. The fabric box 100 has a single PCIe switch 102 with four lanes that are connected to the four host servers 30, 32, 34, and 36. The single PCIe switch 102 has downstream ports 110a-110h that are connected to devices 70, 72, 74, 76, 80, 82, 84, and 86. The single PCIe switch 102 has upstream ports 120, 122, 124, and 126 that are connected to respective host servers 30, 32, 34, and 36. Cabling for the fabric box 100 is relatively straight forward as there is only one cable required for connecting each of the four host servers 30, 32, 34, and 36 to the PCIe switch 102.
FIG. 1C shows a fabric box 130 having a host interface of two PCIe links supporting 8 PCIe lanes. Like the fabric box, device and host server elements are labeled with identical element numbers in FIG. 1C as in FIG. 1A. The fabric box 130 includes two PCIe switches 132 and 134. The PCIe switch 132 includes downstream ports 136a-136d that are connected to devices 70, 72, 74, and 76, respectively. The PCIe switch 134 includes downstream ports 138a-138d that are connected to devices 80, 82, 84, and 86, respectively. In this example, the switch 132 includes two upstream links 140 and 142 that are coupled to the host servers 30 and 32, respectively. The switch 134 includes two upstream links 144 and 146 that are coupled to the host servers 34 and 36, respectively. Each of the upstream links 140, 142, 144, and 146 includes two lanes, each through one cable. Therefore, there is one cable for each lane that connect respective upstream ports to ports on the host server.
FIG. 1D shows a fabric box 150 having a host interface of four PCIe links supporting 16 PCIe lanes. Like the fabric box, device and host server elements are labeled with identical element numbers in FIG. 1D as in FIG. 1A. The fabric box 150 includes two PCIe switches 152 and 154. The PCIe switch 152 includes downstream ports 156a-156d that are connected to devices 70, 72, 74, and 76, respectively. The PCIe switch 154 includes downstream ports 158a-158d that are connected to devices 80, 82, 84 and 86, respectively. In this example, the switch 152 includes two upstream links 160 and 162 that are coupled to the host servers 30 and 32, respectively. The switch 154 includes two upstream links 164 and 166 that are coupled to the host servers 34 and 36, respectively. Each of the upstream links 160, 162, 164, and 166 include four lanes. Therefore, there are four cables for each link that connect respective upstream ports to ports on the host server.
For larger configurations, such as those with multiple links resulting in 8 lanes or 16 lanes, multiple cables are required for each link. The cable order is very important in order for the PCIe physical link between the switch and the host server to be successfully configured to maximize link speed. For example, in the case of a host server having four ports that each are a PCIe lane, the cables should be ordered to be connected to the corresponding upstream ports on the PCIe switch in the same order. FIG. 2A is a close up diagram of the correct cable connection between the switch 152 and the host server 30 in FIG. 1D.
As may be seen in FIG. 2A, the PCIe link 160 has four lanes that include a series of four cables 170, 172, 174, and 176 that connect four respective upstream ports 180, 182, 184, and 186 of the switch 152. The other ends of cables 170, 172, 174, and 176 are connected to ports 190, 192, 194, and 196 of the host server 30. Thus, the group of ports on both the switch 152 and the host server 30 are connected in the same order. This configuration allows maximum speed for the PCIe link 160 as all four lanes may be used.
However, when one of the cables 170, 172, 174, and 176 is misconnected, all of the connections of the link are affected. FIG. 2B is a close up view of the system in FIG. 1D, where the cable 172 from the port 192 has been misconnected to the upstream port 184 of the PCIe switch 152. The cable 174 from the port 194 has also been misconnected to the upstream port 182 of the PCIe switch 152. This misconnection that changes the order of the ports is a misconfiguration case that will cause the PCIe speed of the link to be downgraded from 4 lanes to 1 lane.
With the larger number of cables for certain PCIe link configurations with multiple lanes, such misconnections are increasingly possible. FIG. 2B therefore illustrates a relatively common occurrence where a user connects cables in the wrong order, resulting in a PCIe speed downgrade when the design uses multiple cables for a PCIe link group. Currently, there is no ready method to inform a data center operator that a misconfiguration of cables has occurred, and therefore miscommunication between an allocated device and a host server suffers.
Thus, there is a need for the identification of a misconnection of cables in a multi-lane configuration of PCIe links. There is a further need for automatic detection of the misconnection of cables between a shared resource box and a host server. There is a further need for training a host server device to detect an unexpected reduction in link speed to indicate a cable misconnection.