A supercalculator (or supercomputer) is a computer designed for attaining the highest performances as possible with the known technologies upon its design, in particular in terms of computing rate. Supercomputers draw their superiority relatively to conventional computers both from the technology of the components used and their architecture.
Thus supercomputers have rates of several peta-flops and will soon attain exa-flops. The flops (for “FLoating point Operations Per Second”) is a measurement unit commonly accepted for estimating the processing speed of a computer.
This architecture may notably be in a “pipeline” or parallel, in order to execute several tasks simultaneously. Regardless of the retained architecture, supercomputers contain a very large number of pieces of equipment themselves including a large number of components (memories, microprocessors, etc.)
Typically, each piece of equipment may send an informative message to a monitoring system as soon as one of its components or itself changes status. This type of message is commonly called an “event”. This monitoring system has the mission of collecting and processing all these events and should react accordingly.
For example, when a piece of equipment sends in a given time lapse, a large number of temperature alerts, the monitoring service may have to make the decision of switching it off.
But alerts relating to a single problem may “make their way up” from different pieces of equipment (or components) and also be correlated at the monitoring system. This for example is the case if the whole of the components and/or pieces of equipment located at the top of the cabinets emit temperature alerts, possibly because of a problem of a cooling circuit. It is then important to trigger an alarm for the managers of the supercomputer.
The monitoring service may also supply a database for updating it with these thereby collected pieces of information and correlated. This database may then be used for more complex correlations, statistical calculations, etc.
In order to gain computing power, supercomputers become increasingly complex.
Accordingly, the number of events which may be generated within the supercomputer also increases. This point is all the more crucial since certain problems (for example affecting an area of a supercomputer, a cabinet, etc.) may generate chain events on a large number of pieces of equipment and components simultaneously or within a very short period of time.
Present solutions are based on one or several correlation engines but they already attain the limit of their possibilities. Certain monitoring systems deployed in the field show processing delays of several hours, which may cause a significant taking of risks for the computer (a major incident not reported in due time to the managers, etc.)
The research works and studies aiming at improving the situation essentially deal with the actual correlation engine, or with the adjunction of complementary modules for making the processing chain more efficient. However, the architecture designed around a correlation engine is not suitable for such scaling. The result of this is that only ad-hoc adaptations in the field have been able to be locally undertaken in order to minimize at best the problems of the unsuitability of the existing monitoring systems to the supercomputers.