The field of the invention is that of networks or clusters of computers formed from a number of computers working together. These clusters are used to execute software applications bringing one or more services to users. Such an application can be single or multi-process, and be executed on a single computer or distributed over a number of computers, for example in the form of a distributed application of the MPI (“Message Passing Interface”) type or shared memory type.
The invention applies particularly to functioning management, within the cluster, of such an application termed master or primary, for example by another software application termed intermediate application, for example an application of the “middleware” type. This functioning management may comprise, in particular, the operations of replication, redistribution, reliabilization, or tracing or “debugging” of all or part of this application, within the primary node or in collaboration with other nodes termed secondary.
In order to analyse or reliabilize the functioning of such an application, or to make it more flexible or improve its performance, the use of methods of recording events occurring in this application, in order to be able to replay them is known, i.e. re-execute them or cause them to be produced identically, at another time or on another node or computer. However, current methods of recording as events occur are very wartime consuming and tend to slow down an application too heavily when in normal use.
In addition, if an application used in operation has not been designed from the start to produce such a record, it is difficult and costly to add such functions to it later, and this constitutes a significant risk of errors.
Some methods are also used by debugging programs, which allow monitoring of the operation of an application from outside. However, more often than not, these methods act within the computer system which executes the application, for example by changing or adding new kernel modules in the system. However, these system changes require specific system skills, and can induce heterogeneities between several computers of a network, which can be a source of errors and instabilities. More often than not, these disadvantages greatly limit the use of the record and replay principle, in particular to tuning tasks or to isolated configurations, and are unacceptable for configurations both extensive and stressed in actual production use.
A method of recording and replay is described, for example, in the 2002 article entitled “Debugging shared memory parallel programs using record/replay” by Messrs. Ronsse, Christiaens and De Bosschere in the Belgian review Elsevier B.V. This article describes the use of a method for tracing the functioning of a multi-process application with the aim of debugging it. To reduce the fall-off in performance due to event recording, the article proposes to use intrusive methods to detect certain situations which are sources of uncertainty in the relative progress of independent events affecting a single shared resource (“race conditions”), and to limit recording to these situations.
However, this solution remains limited to debugging applications, more often than not outside of networks in normal operation, and uses intrusive methods which can be complex to implement, constitute risks of error, and can largely depend on the constitution of the application to be traced. In particular while running the master application, the logging operations represent a work load for the operational node, and can be the cause of a fall-off in performance due to the action of the intermediate application.