The present invention relates generally to packet-switched computer networks, and specifically to methods and apparatus for testing and diagnosing malfunctions in such networks.
Packet-switched, source-routing computer networks are used in a growing range of applications. Such networks link multiple computer processors, or nodes, via multiple switches. Typically, a packet of data sent from one of the nodes to another passes through a number of different switches. Each switch along the way reads routing information, which is commonly contained in a header of the data packet, and passes the packet on to the next switch along the way, or to the destination node. Typically, there are multiple different paths available through the network over which any given pair of nodes can communicate. An example of this type of network is the well-known Asynchronous Transfer Mode (ATM) network, which is used in communications between separate computers. Such networks are also used in multi- processor computers, such as the RS/6000 Scalable POWERParallel System (SP) series of computers produced by International Business Machines Corporation (Armonk, N.Y.). In the SP computer, as well as in certain other networks, successive packets in a communication stream between the nodes may be sent over different routes.
Because of the complex topology and hardware of packet-switched networks, when a fault occurs in such a network it can be difficult to identify the exact location and nature of the fault. The difficulty is exacerbated by the fact, noted above, that by their nature such networks use multiple different paths between nodes and are fault-tolerant. A network fault will typically appear not as a total breakdown (which would be relatively easy to find), but rather will present more subtle symptoms. For example, there may be a reduction in throughput between some or all of the nodes, or an increase in the number of xe2x80x9cbad packetsxe2x80x9dxe2x80x94data packets whose content is corrupted and must be discardedxe2x80x94at one or more of the nodes.
There are few efficient tools known in the art for diagnosis of such faults. The diagnostic process is time-consuming and heavily reliant on the intuition and experience of a human system administrator (or service engineer) in deciphering and drawing conclusions from the limited information that is available. This information is typically collected in various system files, such as topology files, error logs and trace files, as are known in the art. These files may be recorded at different nodes of the network and must somehow be collated and analyzed by the administrator. Because few network administrators have the know-how to perform this sort of diagnosis, costly service calls are frequently required.
A further problem in diagnosing network faults is non-deterministic failures, which may occur only under certain conditions, and may not arise at all while the diagnostic tests are being performed. Such failures are referred to with terms such as xe2x80x9csporadic,xe2x80x9d xe2x80x9cintermittent,xe2x80x9d xe2x80x9coverheating,xe2x80x9d xe2x80x9clightning,xe2x80x9d xe2x80x9caging,xe2x80x9d or xe2x80x9cstatics,xe2x80x9d which generally mean only that the cause of the problem is unknown. For example, a high-speed switch or adapter may behave normally in light traffic, and break down only under certain particular stress conditions. At times the only way to find such a problem is to systematically bombard each suspect component of the network with packets from different sources, at controlled rates, gradually eliminating components from consideration until the failure is found. Such a process is difficult to automate, and may require that the network be taken off-line for an extended period. The cost of such down-time for prolonged testing and repair can be enormous. There is therefore a need for systematic methods of diagnostic testing, which can be performed while the network is on-line.
There is a similar lack of tools and techniques for systematically testing the response of switch-related network software to hardware fault conditions. Such techniques are needed particularly in software development and testing stages, to ensure that the software responds properly when faults occur. Current methods of testing use specially-designed simulation hardware, such as cables with broken pins, together with debugging clauses that can be activated in the software itself and dedicated debugging fields in associated data structures. The fault situations created by such methods, however, are limited to a small range of scenarios, which are for the most part different from the real hardware faults that occur in actual networks. Similarly, the software used in debugging mode for fault simulation is different from the actual software product that will be used in the field. Moreover, these testing tools are incapable of simulating the type of transient, non-deterministic failures described above. They do not allow errors to be injected and altered on the fly during a simulation.
It is an object of some aspects of the present invention to provide improved methods for fault simulation and diagnostics in packet-switched data networks.
It is a further object of some aspects of the present invention to provide apparatus and methods for systematically injecting errors into a data network, for purposes of debugging and diagnostics.
Preferred embodiments of the present invention operate in the context of a packet data network, which comprises a plurality of nodes, or processors, mutually coupled by a plurality of switches, such that typically any one of the nodes can communicate with any other one of the nodes, preferably over multiple links. Each of the nodes is coupled to a respective port of one of the switches by a switch adapter, which performs data link functions, as are known in the art, with respect to each data packet sent or received through the network by the node. One of the nodes is a primary node, which manages the configuration of elements of the network, such as the other nodes and switches in the network.
In preferred embodiments of the present invention, the primary node controls testing and diagnosis of elements of the network in real time, while the network is on-line, or at least with minimal interruption of on-line operation, by appropriately setting parameters of the nodes and switches. The testing preferably includes diagnostic testing to locate suspected faults in the switches and switch adapters. Additionally or alternatively, for the purposes of testing, errors are intentionally injected into the network so as to simulate the response of the network elements to faults that may occur.
In some preferred embodiments of the present invention, one of the nodes of the network is used as an error injector, and is isolated for this purpose from the remaining nodes in the network. While the remaining nodes and switches carry out their normal functions on-line, the error injector injects errors into the network in order to test by simulation the response of elements of the network to actual errors that may occur. Generally, if the network is operating properly, the errors will be rejected and, preferably, reported to and logged by the primary node, while normal functions continue without substantial interruption.
In some of these preferred embodiments, the error injector is used to simulate the effect of a faulty switch adapter, by injecting bad packets into the network.
In other preferred embodiments, the error injector is used to simulate switch faults, such as failures in the ports or central queue or in send and/or receive logic of the switch. Preferably, the error injector sends a command packet to one of the switches, such as a packet instructing the switch to reset its queue or initialize operation, while the switch is in the midst of its normal functions. The error injector prepares the command packet in such a manner that it appears to have originated from the primary node, which is the only node ordinarily entitled to issue such commands. The primary mode generally issues a reset or initialization command when it is informed that a fault or error has occurred in a particular switch. Alternatively or additionally, the error injector sends an error reporting packet to the primary node, prepared so as to appear to have originated from one of the switches in the network.
There is therefore provided, in accordance with a preferred embodiment of the present invention, in a computer network system that includes a multiplicity of nodes interconnected by a network of switches, wherein data are normally conveyed in the network according to predetermined conventions, a method for simulation testing of the system, including:
selecting one of the nodes to serve as an error injector;
injecting data into the network from the error injector node in a manner that violates the predetermined conventions, so as to simulate an error condition in the system; and
observing operation of the system following the injection of the data so as to evaluate a response of the system to the error condition.
Preferably, selecting the one of the nodes includes taking the error injector node off-line, while normal data transmission continues among other nodes and switches in the system.
Further preferably, injecting the data includes sending a corrupted data packet to one of the nodes, and observing the operation of the system includes ascertaining that the corrupted packet has been detected. Most preferably, the nodes are linked to the network by respective data link adapters, and sending the corrupted data packet includes transmitting to the one of the nodes a packet having a corrupted data link header, so as to ascertain that the respective data link adapter detects the corrupted header.
In a preferred embodiment, injecting the data includes sending a command to one of the switches, wherein the command is of a type that is normally sent in response to an error in the network, and wherein sending the command includes choosing a command from the group of commands consisting of an initialization command, a reset command and a port disable command.
In another preferred embodiment, the system includes a primary node, which normally receives service messages from the switches in the network, and injecting the data includes sending data to the primary node having the form of a service message from one of the switches. Preferably, sending the data includes sending an error report. Alternatively or additionally, injecting the data includes sending a command to the one of the switches that causes the switch to convey service messages to the error injector node, rather than to the primary node.
There is further provided, in accordance with a preferred embodiment of the present invention, a manageable computer network system, including:
a network of switches, among which data are normally conveyed in the network according to predetermined conventions; and
a multiplicity of nodes interconnected by the switches, one of which nodes is selected to serve as an error injector, which injects data into the network in a manner that violates the predetermined conventions, so as to simulate an error condition in the system in order that a response of the system to the error condition can be observed.
Preferably, the error injector node is taken off-line while normal data transmission continues among other nodes and switches in the system.
In a preferred embodiment, the error injector node sends a corrupted data packet to a target node, so that it can be ascertained that the corrupted packet has been detected by the target node. Preferably, the system includes data link adapters, which link the nodes to the switches in the network, wherein the corrupted data packet has a corrupted data link header, which is detected by one of the data link adapters that is associated with the target node.
In another preferred embodiment, the error injector node sends a command to one of the switches of a type that is normally made in response to an error in the network.
In still another preferred embodiment, the multiplicity of nodes includes a primary node, which normally receives service messages from the switches in the network, and the error injector node sends data to the primary node having the form of a service message from one of the switches.
There is moreover provided, in accordance with a preferred embodiment of the present invention, a computer software product for simulation testing of a computer network system including a network of switches linking a plurality of processor nodes, wherein data are normally conveyed in the network according to predetermined conventions, the product including computer-readable code, which is read by one of the nodes selected to serve as an error injector node among a multiplicity of nodes coupled to the network and causes the error injector computer to inject data into the network in a manner that violates the predetermined conventions, so as to simulate an error condition in the system, wherein operation of the system following the injection of the data is observed in order to evaluate a response of the system to the error condition.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: