1. Field of the Invention
The present invention generally relates to a system and method for detecting a faulty object in a system including a plurality of objects in communication with each other in an n-dimensional architecture, and a method of deploying computing infrastructure to implement the method. For example, the exemplary methods and systems according to the present invention can detect a faulty processor via geometrically-aware power-on-self-tests, and/or detect and localize bad (e.g., faulty) processors and/or communication links in computing systems (e.g., parallel computing systems, telecommunication communication switching networks, etc.) which include a plurality of objects in an n-dimensional architecture based on statistically significant differences and intersecting lines of communication. A first plane of objects in the n-dimentsional architecture is probed, and at least one other plane of objects is probed to result in identifying a single faulty object in the system.
2. Description of the Related Art
In computing systems which are made up of a plurality of processors, it is desirable to be able to detect and locate faulty objects (e.g., hardware), such as processors (e.g., faulty nodes) and/or communications links in computing systems (e.g., parallel computing systems) which include a plurality of objects in an n-dimensional architecture. When a bad node/connection is found, the options generally are to replace the faulty hardware, employ fault tolerance of one sort if the data is being corrupted, employ fault tolerance of a second type if nodes need to be routed around, etc.
The related art methods generally use localized tests to find the faulty nodes. However, the related art methods do not work well, particularly, when the computing system gets very large. For example, if the architecture of the computing system (e.g., parallel computing systems) is such that the number of processors is greatly increased (e.g., 65,000 or more processors).
The related art methods do not scale well, provide only rough approximations as to the location of the faulty object(s) (e.g., faulty nodes and/or communications links), and/or take a long time to run, etc
The related art methods have not addressed or solved the aforementioned problems.