1. Field
The present disclosure concerns the administration of complex computer systems and more particularly a method and a device for processing administration commands in a cluster.
2. Description of the Related Art
HPC (standing for High Performance Computing) is being developed for university research and industry alike, in particular in technical fields such as aeronautics, energy, climatology and life sciences. Modeling and simulation make it possible in particular to reduce development costs and to accelerate the placing on the market of innovative products that are more reliable and consume less energy. For research workers, high performance computing has become an indispensable means of investigation.
This computing is generally conducted on data processing systems called clusters. A cluster typically comprises a set of interconnected nodes. Certain nodes are used to perform computing tasks (compute nodes), others to store data (storage nodes) and one or more others manage the cluster (administration nodes). Each node is for example a server implementing an operating system such as Linux (Linux is a trademark). The connection between the nodes is, for example, made using Ethernet or Infiniband communication links (e.g., Ethernet and Infiniband are trademarks). Each node generally comprises one or more microprocessors, local memories and a communication interface.
FIG. 1 is a diagrammatic illustration of an example of a topology 100 for a cluster, of fat-tree type. The latter comprises a set of nodes of general reference 105. The nodes belonging to the set 110 are compute nodes here whereas the nodes of the set 115 are service nodes (storage nodes and administration nodes). The compute nodes may be grouped together in sub-sets 120 referred to herein as “compute islets,” the set 115 being referred to herein as a service islet.
The nodes are linked together by switches, for example hierarchically. In the exemplary embodiment illustrated in FIG. 1, the nodes are connected to first level switches 125 which are themselves linked to second level switches 130 which in turn are linked to third level switches 135.
The nodes of a cluster as well as the other components such as the switches are often grouped together in racks, which may themselves be grouped together into islets. Furthermore, to ensure proper operation of the components contained in a rack, the rack generally comprises a cooling system, for example a cooling door (often called a cold door).
The management of a cluster, in particular the starting, stopping or the software update of components in the cluster, is typically carried out from administration nodes using predetermined processes or directly by an operator. Certain operations such as starting and stopping of the whole of the cluster, islets or racks, may also be carried out manually, by node or by rack.
It has been observed that although the problems linked to the management of clusters do not generally have a direct influence on the performance of a cluster, they may be critical. Thus, for example, if a cooling problem for a room housing racks is detected, it is often necessary to rapidly stop the cluster at least partially to avoid overheating of components which could in particular lead to the deterioration of hardware and/or data loss.
There is thus a need to improve the management of clusters, in particular to process administration commands.