The present disclosure relates generally to information handling systems, and more particularly to a method and apparatus for recovering from a failed I/O controller in an information handling system.
As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
In conjunction with information handling systems, there is always a need in the enterprise space to increase an availability of servers. That is, a server should be able to run, with failed components, until a service person is able to correct the problem, rather than being rendered offline.
In one environment, e.g., the Power Edge 8755 server available from Dell Computer of Round Rock, Tex., includes PCI slots located on the server's IO board. The IO board houses two P64H2 PCI bus controllers. If a PCI bus controller fails, then the server will lockup and reboot. Once the server reboots, the failed PCI bus controller will be disabled and any adapters plugged into the slots of the PCI slots will be unable to function. In addition, if the boot device is on the PCI slot connected to the failed PCI bus controller, the server will not boot. Such an occurrence is undesirable.
In the current generation of servers, there are three methods used to combat a PCI controller failure, as discussed below.
1) Adding redundant components to the server system which support failover or adapter teaming can be used to combat a PCI controller failure. However, this approach involves using two of the same PCI adapters connected to the same location in a master-slave configuration coupled with some type of failover or teaming driver. Examples can include redundant NICs, Fibre Channel HBAs, or SCSI RAID controllers. For server offerings sized less than three (3) rack units, there is a finite on the number of PCI slots available in a given system. This lack of adapter space is further magnified by the emergence of blade and brick servers. With this type of space utilization, it becomes difficult to populate two slots for one function. There is just too high a premium on doubling up the number of adapters.
2) Microsoft Cluster Server (MSCS) clustering is another method of recovering from a PCI bus failure. With MSCS clustering, however, an identically configured server is connected to the same storage and held in a passive state until the first server encounters a failure. Once any component of the first server fails, all its operations are taken over by the second server. The primary drawback to this scheme, however, is that keeping a duplicate server for use ‘only in the case of an emergency’ can be cost prohibitive. A customer ends up paying 2× for 1× the performance. Furthermore, MSCS clustering is only applicable when using direct-attached-storage.
3) Another option is to reboot the server with the failed PCI component disabled. If the boot device or any adapters with connectivity to external media is present on the failed PCI busses, this scheme is rendered useless. Such a method is effective as long as the failed PCI controller didn't house any system-critical devices behind it. If system critical devices are present behind the failed controller, manual reconfiguration of the PCI devices will be necessary to continue worthwhile operations. This highlights the difference between uptime and true high-availability.
FIG. 1 illustrates a block diagram view of an I/O design for an information handling system known in the art and susceptible to PCI controller failure as discussed herein. The I/O design 10 includes first and second I/O controllers (12, 14). The first I/O controller 12 controls first and second PCI slots (16, 18). The second I/O controller 14 controls first and second PCI slots (20, 22). The bus speeds of the I/O controllers are controlled via respective I/O bus speed strapping inputs (24, 26).
Accordingly, it would be desirable to provide method and apparatus for booting a server with a failed PCI bus controller, which may or may not have a boot device behind it, absent the disadvantages found in the prior methods discussed above.