Computing of the high performance type has been developed for university research as well as for industry, in particular in the technical fields such as the automobile, aeronautics, energy, climatology and the life sciences. Modelling and simulating make it possible in particular to reduce the costs of development, accelerate putting on the market products that are innovative, more reliable and that consume less energy. For researchers, high performance computing has become an indispensable means of investigation.
Computers of the high performance type are formed by combining together calculation nodes that each gather together several processors. These computers as such consistent in multiplying the calculation nodes, in such a way as to simultaneously execute a large number of calculations. As shown in FIG. 1, these computers 1 known in prior art typically comprise several calculation nodes 2 each comprising a network controller 3 and are connected together by an interconnection network 4. Moreover, a management network 5 interacts with the calculation nodes 2 in order to configure them, monitor them and administer them.
In computers of the high performance type, comprised for the largest of them of a few tens of thousands of calculation nodes, one of the most difficult problems to resolve is to identify the calculation nodes that have intermittent malfunctions, which affect the calculations only indirectly (slower calculations, incorrect calculations, etc.). Furthermore, the number of breakdowns increases exponentially with the number of calculation nodes that the computer contains.
A breakdown of a calculation node can in particular be caused by the failure of a hardware component such as a central processing unit (CPU), a memory or electrical power supply, or by a defect (typically called a bug) of a software component implemented by the calculation node considered or by infrastructure elements.
An existing solution, when it is suspected that the computer is slowed down, is to launch an analysis consisting in performing a dichotomous succession of calculations within the high performance computer in order to isolate the defective node. When a dichotomous succession of calculations is launched, the computer cannot be used for purposes other than that of diagnostic. Consequently, this analysis requires the exclusive use of the high performance computer and is therefore very expensive.