1. Technical Field
This invention generally relates to maintenance and support of a parallel computing system, and more specifically relates to an apparatus and method for a scalable property viewer on a massively parallel computer system.
2. Background Art
Efficient fault detection and recovery is important to decrease down time and repair costs for sophisticated computer systems. On parallel computer systems with a large number of compute nodes, a failure of a single component may cause a large portion, or the entire computer to be taken off line for repair.
Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.
The Blue Gene/L supercomputer communicates over several communication networks. The 65,536 computational nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer. The Blue Gene/L supercomputer is scalable such that a system can be operational without the full set of 64 racks. Each rack is logically 16×8×8 nodes connected together in the torus. A number of racks can be cabled together to complete the torus on a smaller scale (up to 64 racks) in the Blue Gene/L system.
In the prior art, the Blue Gene/L supercomputer incorporated a data collection mechanism in the service node that compiles information from all the nodes in the system such as the temperature at the nodes. The information is provided to system administrators in a tabular form. This information is used to monitor potential problems and troubleshoot system failures.
Thus, while the prior art provided a mechanism to view properties of the system, it did not provide a convenient and useful tool to view properties in a graphical form that is scalable as the size of the system changes. Without a way for system administrators to easily view and interpret properties of the full system, parallel computer administrators will continue to waste time and effort monitoring parallel computer systems.