1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for validating a cabling topology in a distributed computing system.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Distributed computing is an area of computer technology that has experienced such advances. Distributed computing is the execution of a task (split up and specially adapted) on multiple processors in order to obtain results faster. Distributed computing is based on the fact that the process of solving a problem usually can be divided into smaller tasks, which may be carried out simultaneously with some coordination. A distributed computing system, therefore, is a computing system that uses two or more network connected computing devices to accomplish a common task. Such computing devices may be implemented as stand-alone computers, blade servers installed in a server chassis, compute nodes installed in a mid-plane of a parallel computer, or any other computing device.
Regardless of their implementation, at some level, computing devices are typically connected in the distributed computing system through circuit board traces and connectors or using data communication cables. When the computing devices are implemented as stand-alone computers, each computer is connected to the other computing devices using cables and none of the computers are connected together using circuit board traces and connectors. When the computing devices are implemented as blade servers installed in multiple server chassis, the blade servers are typically connected intra-chassis through circuit board traces and connectors, while inter-chassis connections occur using cables. Similarly, when computing devices are implemented as compute nodes installed in a mid-plane of a parallel computer, the compute nodes are typically connected intra-mid-plane through circuit board traces and connectors, while the inter-mid-plane connections occur using cables. These units that are connected by cables, whether each unit is a stand-along computer, a blade server chassis, or a mid-plane of a parallel computer, are referred to as ‘cabled nodes.’ That is, cabled nodes of a distributed computing system are apparatus connected by cables for data communication. A cabled node, therefore, may be implemented using a stand-alone computer, a blade server chassis, or a mid-plane of a parallel computer.
As mentioned above, a distributed computing system is a computing system that uses two or more network connected computing devices to accomplish a common task. Typically, the common task of a distributed computing system is to run a distributed software application. Portions of the distributed software application run on each computing device in the distributed computing system simultaneously. The portion of the distributed software application being executed at any one moment in the distributed computing system may be identical across all the computing devices. However, different computing devices may also be executing different portions of the distributed software application at any one moment.
As each computing device processes a portion of the distributed application software, there are generally two ways that the computing devices may communicate, shared memory or message passing. Shared memory processing needs additional locking for the data and imposes the overhead of additional processor and bus cycles and also serializes some portions of the application.
Message passing processing uses high-speed data communications networks and message buffers to effect communication, but this communication adds transfer overhead on the data communications networks as well as additional memory need for message buffers and latency in the data communications among computing devices. Designs of distributed computing systems use specially designed data communications links so that the communication overhead will be small but it is the distributed software application that dictates the volume of the traffic.
Many data communications network topologies are used for message passing among computing devices in distributed computing systems. Typically, message passing requires computing devices be organized in a network topology such as ‘torus’ or ‘rectangular mesh,’ for example, to effect point-to-point communication among computing devices. A torus network topology connects the computing devices in a multi-dimensional mesh with wrap around links in each dimension. For example, a torus network topology may connect the computing devices in a three-dimensional mesh. In such a torus network topology, every computing device, therefore, is connected to its six neighbors, and each computing device is addressed by its x, y, and z coordinates in the mesh. A rectangular mesh network topology is similar to a torus network topology, but the connections in a rectangular mesh network topology do not wrap around in all dimensions. All computing devices can still communication with each other, but performance is less than optimal since messages from computing devices near the edges have to traverse many intervening computing devices to reach a computing device on the other edge of the mesh. In distributed software applications that do not require extensive point-to-point data communications, the less than optimal performance of a rectangular mesh topology may be satisfactory in light of the increased complexity in the physical hardware typically required to implement a torus topology.
Regardless of the network topology used for data communications in a distributed computing system, the cabling topology must match a user's desired network topology. Consider, for example, that a user selects various blade servers of a distributed computing system on which to operate the user's distributed software application. Further consider that the blade servers are installed in multiple blade server chassis. If a distributed software application requires a torus network topology for data communications among blade servers, then the cables between multiple server blade chassis must also be configured in a torus topology to effect inter-chassis data communication suitable for the application. Because system architects rarely modify the arrangement of the physical cables between cabled nodes in a distributed computing system, users of the distributed computing system must carefully choose particular cabled nodes within a fixed cabling scheme that provide the proper cabling topology. That is, by choosing particular cabling nodes in the distributed computing system that participate in processing the user's distributed software application, the user may obtain the desire cabling topology, and therefore, the desired communications network topology.