Parallel processing computer systems have found application in a number of different computing scenarios, particularly those requiring high performance and fault tolerance. For instance, airlines rely on parallel processing to process customer information, forecast demand and decide what fares to charge. The medical community uses parallel processing supercomputers to analyze magnetic resonance images and to study models of bone implant systems. A parallel processing architecture generally allows several processors having their own memory to work simultaneously. Parallel computing systems thus enable networked processing resources, or nodes, to cooperatively perform computer tasks.
The best candidates for parallel processing typically include projects that require many different computations. Unlike single processor computers that perform computations sequentially, parallel processing systems can perform several computations at once, drastically reducing the time it takes to complete a project. Overall performance is increased because multiple nodes can handle a larger number of tasks in parallel than could a single computer.
Other advantageous features of some parallel processing systems regard their scalable or modular nature. This modular characteristic allows system designers to add or subtract nodes from a system according to specific operating requirements of a user. Parallel processing systems may further utilize load balancing to fairly distribute work among nodes, preventing individual nodes from becoming overloaded, and maximizing overall system performance. In this manner, a task that might otherwise take several days on a single processing machine can be completed in minutes.
In addition to providing superior processing capabilities, parallel processing computers allow an improved level of redundancy, or fault tolerance. Should any one node in a parallel processing system fail, the operations previously performed by that node may be handled by other nodes in the system. Tasks may thus be accomplished irrespective of particular node failures that could otherwise cause a failure in non-parallel processing environments.
Despite the improved fault tolerance afforded by parallel computing systems, however, faulty nodes can hinder performance in the aggregate. It consequently becomes necessary to eventually replace or otherwise fix underperforming nodes and/or associated connections. For instance, it may be advantageous to check for faulty cables, software, processors, memory and interconnections as modular computing components are added to a parallel computing system.
The relatively large number of nodes used in some such systems, however, can complicate node maintenance. Ironically, the very redundancy that enables fault tolerance can sometimes challenge processes used to find faulty nodes. With so many nodes and alternative data paths, it may be difficult to pinpoint the address, or even the general region of a node, or nodal connection requiring service.
As such, a significant need exists for a more effective way of determining and locating faulty nodes in a parallel processing environment.