1. Technical Field
The present invention relates in general to the field of data processing systems. More specifically, the present invention relates to the field of diagnosing problems within data processing system systems.
2. Description of the Related Art
In recent years, hardware and software developers have improved server architectures and designs with the goal of more robust and reliable servers for mission critical networking applications. For example, some server applications require that servers respond to client requests in a highly reliable manner.
Additionally, processors implemented in server computers have substantially improved; where processor speeds and bandwidth greatly exceed the capacity of the input/output interfaces such as industry standard architecture (ISA), peripheral component interconnect (PCI), Ethernet, etc. This capacity inequality limits both server throughput and the speed at which data can be transferred between servers on a network. Different server standards have been proposed to improve network performance. The differing server standard proposals led to the development of the InfiniBand Architecture Specification, which was adopted by the InfiniBand Trade Association in October 2000. InfiniBand is a trademark of the InfiniBand Trade Association.
The InfiniBand Architecture (IBA) specifications define InfiniBand operation but limit the scope of the architecture to functions that can be performed only over the InfiniBand wires. Given that IBA is a clustering fabric, an entity is needed to initialize, configure, and manage the fabric. IBA defines this entity as a “Subnet Manager” (SM), which is tasked with the role of subnet administration. The SM performs its tasks in-band (i.e., over IB links) and discovers and initializes devices (e.g., switches, host adapters, etc.) that are coupled to the IB fabric.
With the IBA's scope limited to in-band functionality only, any failures that result in loss of in-band communications are difficult to diagnose and time intensive to remedy. Some IB vendors have attempted to address this shortcoming in a variety of methods, such as “problem isolation” documents or applications that communicate out-of-band with the SM. These applications provide the user a view of the fabric and, in case of in-band failures, log events that may be useful in determining the cause of the failure. While the latter approach can yield additional failure information, the scope is limited to only the observations of the SM. As cluster sizes increase, a one-sided view of fabric failures makes problem isolation difficult and may require a “process of elimination” technique of determining the cause of failures. A “process of elimination” method is cost-prohibitive, since problem determination entail replacement of non-defective parts. Therefore, there is a need for a system and method for addressing the aforementioned limitations of the prior art in detecting the cause of failure in IB networks.