Parallel processing computer systems have found application in a number of different computing scenarios, particularly those requiring high performance and fault tolerance. For instance, airlines rely on parallel processing to process customer information, forecast demand and decide what fares to charge. The medical community uses parallel processing supercomputers to analyze magnetic resonance images and to study models of bone implant systems. A parallel processing architecture generally allows several processors having their own memory to work simultaneously. Parallel computing systems thus enable networked processing resources, or nodes, to cooperatively perform computer tasks.
The best candidates for parallel processing typically include projects that require many different computations. Unlike single processor computers that perform computations sequentially, parallel processing systems can perform several computations at once, drastically reducing the time it takes to complete a project. Overall performance is increased because multiple nodes can handle a larger number of tasks in parallel than could a single computer.
Other advantageous features of some parallel processing systems regard their scalable, or modular nature. This modular characteristic allows system designers to add or subtract nodes from a system according to specific operating requirements of a user. Parallel processing systems may further utilize load balancing to fairly distribute work among nodes, preventing individual nodes from becoming overloaded, and maximizing overall system performance. In this manner, a task that might otherwise take several days on a single processing machine can be completed in minutes.
In addition to providing superior processing capabilities, parallel processing computers allow an improved level of redundancy, or fault tolerance. Should any one node in a parallel processing system fail, the operations previously performed by that node may be handled by other nodes in the system. Tasks may thus be accomplished irrespective of particular node failures that could otherwise cause a failure in non-parallel processing environments.
Despite the improved fault tolerance afforded by parallel computing systems, however, faulty nodes can hinder performance in the aggregate. It consequently becomes necessary to eventually replace or otherwise fix underperforming nodes and/or associated connections. For instance, it may be advantageous to check for faulty cables, software, processors, memory and interconnections as modular computing components are added to a parallel computing system. Connections along the outer connecting surfaces of node cells are particularly prone to damage, improper installation and/or routing. As a result of being physically cabled (as opposed to the factory construction of the cell internal wiring), the cell surface connections are much more susceptible to cable damage, human error in cabling, and configuration issues that may result in a nonfunctional system.
The relatively large number of nodes used in some such systems, however, can complicate node maintenance. Ironically, the very redundancy that enables fault tolerance can sometimes challenge processes used to find faulty nodes along a node cell surface, or face. With so many nodes and alternative data paths, it may be difficult to pinpoint the address or even the general surface of a node cell or nodal connection requiring service.
As such, a significant need exists for a more effective way of determining and locating faulty nodes in a parallel processing environment.