Although a Network Interface Controller or Network Interface Card (collectively, “NIC”) is often understood to be a computer circuit board or card-type device that is often installed in a computer or similar system having a data processing means, it is also often believed to be a device that provides a full time connection to a network. Unfortunately, a NIC is a device that is capable of providing a full time connection, but in many circumstances, that capability is not satisfied as it often drops its connection with its linked partner device upon the occurrence of various events such as, but not limited to: rebooting, Preboot Execution Boot (PXE) ROM loading, network device driver loading, and the like. Though the communication connection drop occurring during these events has often logically been acceptable in the computing environment as a means to assure the reliability, state or integrity of a link partner, these events often now result in reset situations where significant time is lost, improved conveniences are foregone, and communication gaps create functional inoperability within the environment. Therefore, while there may exist conveniences with various improvements in the computing environments, such as PXE environments where the computer becomes bootable off of the network, this additional capability and performance is often sacrificed as a communication link fails between the NIC and the link partner.
A routine example of this lost effort is known to occur in a blade system or a server blade environment, wherein the terms are used interchangeably herein. A blade system is a data processing system that has a server chassis or cabinet and a plurality of racks. Often, each cabinet rack can hold a rack mounted device (also referred to herein as a blade or server blade) on which one or more general purpose processors and/or memory devices are attached. A blade or blade server is mounted in a rack as an input/output (I/O) device, wherein other devices capable of being mounted for I/O capabilities is also understood. The racks are typically vertically spaced within the cabinet according to an industry standard displacement (the “U”). Cabinets and racks are characterized in terms of this dimension such that, for example, a 42U cabinet is capable of receiving 42 1U rack-mounted devices, 21 2U devices, and so forth. Dense server designs are also becoming available, which allow a server chassis to be inserted into a cabinet rack, thus allowing greater densities than one server per 1U. To achieve these greater densities, the server chassis may provide shared components, such as power supplies, fans, or media access devices which can be shared among all of the blades in the server blade chassis.
In a server blade environment, Serial Over Local Area Network (SOL) is a function providing access by a blade to serial consoles. The SOL feature lets servers transparently redirect the serial character stream from the baseboard universal asynchronous receiver/transmitter (UART) to and from a remote client system over the Local Area Network (LAN). The UART generally translates data between parallel and serial interfaces, and is often used for serial data telecommunication as it converts bytes of data to and from asynchronous start-stop bit streams represented as binary electrical impulses.
In general, SOL has the following benefits compared to a serial interface: 1) eliminates the need for a serial concentrator, 2) reduces the amount of cabling, and 3) allows remote management of servers without video, mouse, or keyboard (headless servers). Many of the blade server components work together to implement SOL, including the chassis Management Module (“Management Module”), Baseboard Management Controller (BMC), and Ethernet Switch Module, as well as the blade servers' BIOS, Advanced System Management Processor, and ethernet interface firmware, for example.. Additionally, the SOL typically enables the LAN connection to be shared as between the operating system (OS) and BMC.
In some instances, however, an established SOL session may drop for a period in excess of a few seconds and even require a number of minutes. Such an extended connectivity time causes a SOL session to go to the “Active” and “Not Ready” states in operation. Typically, an operator is required to thereafter wait an additional period of time (i.e., five minutes) for the Management Module to time-out the current inactive session so it may thereafter be restarted, after waiting for the SOL session on a failing blade to go into the “Enabled” state.
A Management Module (MM) is typically a hot-swappable hardware device plugged into a blade environment chassis management bay and functions as a system-management processor (e.g., service processor) and keyboard, video, and mouse (KVM) multiplexor for blade servers. The management module is typically connected to the cluster management Virtual Local Area Networks (VLANs) for security enhancement.
A BMC is typically a specialized microcontroller embedded on the motherboard of many computers, and often, servers. The BMC is the intelligence function in the Intelligent Platform Management Interface (IPMI) architecture, and it manages the interface between system management software and platform hardware. Different types of sensors built into the computer system report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system (OS) status, etc. The BMC function is capable of monitoring the sensors and can send alerts to a system administrator or user via the network if any of the parameters do not stay within preset limits, indicating a potential failure of the system. In operation, the administrator can also remotely communicate with the BMC to take some corrective action such as resetting or power cycling the system to get a hung OS running again. Typically, the BMC communicates with a BMC management utility (BMU) on a remote client using IPMI protocols.
However, even after first dropping an established SOL session and then restarting the session based on understanding simplistic link protocols of the link partner, normal SOL communication is but only a possibility and not guaranteed.
In another scenario, if the SOL session drop occurs during the loading of the operating system and a user is desirous to change configuration information during such time at which the SOL is “Not Ready”, an additional extend period of time results as the user waits until the operating system is completely loaded. It is Only after an operating system completes its loading that a user may thereafter restart the operating system. Only during this second operating system loading would a user then be able to obtain access to the serial interface and to the system resources.
Scripted solutions have also been an effort attempted by some to overcome delays resulting in dropped sessions. However, even where customized scripts have been attempted by transmitting redundant packets along a communication link numerous times to extend disruption or retry intervals, these attempts have not been successful given new technologies mandating low disruption times and additional features as is set forth herein
Additionally, and again unfortunately, these extended delays are now met with operability delays and failure as concurrent features such as KVM, virtual media tray, etc. are less tolerant of link disruption times when compared to that of SOL for example, where such concurrent features often require maximum disruption time of no more than 50 milliseconds (ms) and some no more than 5 ms. For instance, the IBM® BladeCenter provides built-in KVM access to each blade server in the chassis allowing operators to remotely access system BIOS information and graphical user interfaces (GUIs). Unlike serial console access, the KVM interface provides a GUI that requires a human operator to use and interpret all data. Since the KVM is generally used to provide occasional access to systems for maintenance and emergency recovery, timing is essential. As can be appreciated reliable, economical and efficient techniques for establishing and maintaining communication with NICs and their link partners, where the NIC has knowledge of its environment including its link partner and wiring characteristics, are highly desirable.
Accordingly, what is needed is a method and system for establishing and maintaining communication with NICs and their link partners, where the NIC has knowledge of its environment including its link partner(s).