Field
The present disclosure concerns the management of computer systems and more particularly a method and device for processing commands in a set of components of a computer system such as a cluster.
Description of the Related Art
HPC (standing for “High Performance Computing”) is being developed for university research and industry alike, in particular in technical fields such as aeronautics, energy, climatology and life sciences. Modeling and simulation make it possible to reduce development costs and to accelerate the placing on the market of innovative products that are more reliable and consume less energy. For research workers, high performance computing has become an indispensable means of investigation.
This computing is generally conducted on data processing systems called clusters. A cluster typically comprises a set of interconnected nodes. Certain nodes are used to perform computing tasks (compute nodes), to store data (storage nodes), or manage the cluster (administration nodes). Each node is for example a server implementing an operating system such as Linux™. The connection between the nodes is, for example, made using Ethernet™ or Infiniband™ communication links. Each node generally comprise one or more microprocessors, local memories, and/or a communication interface.
FIG. 1 is a diagrammatic illustration of an example of a topology 100 for a cluster, of fat-tree type. The latter comprises a set of nodes of general reference 105. The nodes belonging to the set 110 are compute nodes whereas the nodes of the set 115 are service nodes, e.g., storage nodes and administration nodes. The compute nodes may be grouped together in sub-sets 120 referred to herein as “compute islets,” while service nodes may be grouped together as a “service islet” 115.
The nodes are linked together by switches, for example hierarchically. In the exemplary embodiment illustrated in FIG. 1, the nodes are connected to first level switches 125 which are themselves linked to second level switches 130 which in turn are linked to third level switches 135.
The nodes of a cluster as well as the other components of the cluster such as the switches are often grouped together in racks, which may themselves be grouped together into islets. Furthermore, to ensure proper operation of the computing components contained in a rack, the rack may generally comprise a cooling system, for example a cooling door (often called a cold door).
The management of a cluster, in particular the starting, stopping, or the software update of computer components in the cluster, is typically carried out from administration nodes using predetermined processes or directly by an operator. Certain operations such as starting and stopping of the whole of the cluster, islets, or racks, may also be carried out manually, by node, or by rack.
It has been observed that although the problems linked to the management of clusters do not generally have a direct influence on the performance of a cluster, they may be critical under certain circumstances. Thus, for example, if a cooling problem for a room housing racks is detected, it is often necessary to rapidly stop the cluster at least partially to avoid overheating of components which could in particular lead to the deterioration of hardware and/or data loss. However, the stopping or the starting of components of a cluster may be complex. The set of components of a computer system may be constituted by a very large number of components of different types. Further, each type of component may have particular specificities linked to its stopping or starting.
There is thus a need to improve the management of clusters, in particular to process administration commands.