The invention relates to monitoring processes in a computer system. More particularly, the invention relates to a method for monitoring a computer process, to a monitor for monitoring a computer process, to a configuration management system for a computer system and to a computer system incorporating a computer system process and/or a managed process.
The invention find particular, but not exclusive, application to the monitoring of a process called a configuration management system (CMS) daemon (CMSD). A daemon provides a background service in a computer system.
A CMSD manages various system entities, or objects, which could be physical devices or could be software entities. In a particular example, a CMSD is connected via a UNIX socket via an application program interface (API) to application programs (UNIX is a registered trademark in the United States and other countries, exclusively licensed through X/Open Company, Ltd.). The behavior of the CMSD is specified using CMS definitions (CMSDEFs). A CMSDEF includes declarations for objects managed by the CMSD, state evaluations (statements for evaluating the states of objects), and transition code which is executed when a transition occurs between the states of an object.
The CMSDEFs can be thought of as being similar to a set of state machines for the objects to be managed, and the CMSD executes the state machines.
As mentioned, an example of a CMS operates as a daemon (i.e., it supplies a management service in the background). If the CMSD service becomes unavailable, then at least aspects of the operation of the computer system may be compromised. For example, in a particular example of a CMSD for use in a fault tolerant computer system, if the CMSD service becomes unavailable, then the fault tolerance can be compromised. Accordingly, it is necessary to monitor the CMSD to ensure that should it die during operation some corrective action will be taken.
In a prior example of a CMSD for a fault tolerant computer system operable under the UNIX operating system, a simple monitor was provided. This monitor was configured to search for the name of the CMSD process in a UNIX process table. If the process table no longer contained any process of that name, the monitor generated an error message. Of course, a process masquerading as CMSD, or a child of CMSD, or even a non-functioning CMSD that had not been flushed from the process table, would satisfy this monitor. Additionally, with the prior example of a CMSD monitor, the only facility offered was to indicate when the CMSD died, without any recovery being effected. As a result it was necessary for the operator to restart the CMSD manually.
An approach to process monitoring could be based on a system which implements a parent-child approach to process creation, for example in the manner of a UNIX style operating system. With such an approach, a monitored process would be created by a process monitor in the form of a further process that always acts as the parent of the monitored process. This would give the monitor process direct access to information about the monitored process and would usually include it being informed about the death of the monitored process by the operating system. However, the reliance on a direct parent-child relationship puts constraints on the overall system. Also, the monitored process might fail in ways that would not be communicated to its parent by the operating system.
Accordingly, an aim of the invention is to provide for process monitoring with a higher degree of reliability than is available with prior approaches, while still providing for flexible operation and, where possible, automatic restarting of a monitored process that has failed.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Combinations of features from the dependent claims may be combined with features of the independent claims as appropriate and not merely as explicitly set out in the claims.
In accordance with one aspect of the invention, there is provided a method of monitoring by a process monitor of a process in a computer system, where the monitored process is not a child of the process monitor. The process monitor uniquely determines the identity of a monitored process and verifies the correct operation of the monitored process. In the absence of verification of the correct operation of the monitored process, the monitored process is caused to initiate (to restart). On successful restarting of the monitored process, the monitored process is uniquely identified to the system.
In accordance with another aspect of the invention, there is provided a method of initiating a process to be monitored in a computer system. The method of initiating a process to be monitored comprises the spawning of a new process by, for example, an upgrade version of an existing process, and then the new process checking that it is operable. In response to a positive result to the tests, the monitored process uniquely identifies itself to the computer system and causes the existing monitored process to terminate, whereby the new process becomes the monitored process.
In accordance with a further aspect of the invention, there is provided a computer system comprising a process to be monitored, the process to be monitored being configured, on successful initiation (starting), uniquely to identify itself to the system, and a process monitor configured: uniquely to determine to identity of a monitored process; to verify correct operation of the monitored process; and, in the event of being unable to verify correct operation of a monitored process, to cause the monitored process to initiate (to restart).
It should be noted that where reference is made to initiating a process, this can relate to starting a new process or an upgrade version of a process, or restarting an existing process, as appropriate. The monitoring of processes in the manner of an embodiment of the invention means that there is no need to rely on a parent child relationship. This enables xe2x80x98abdicationxe2x80x99 by one process to an upgrade version of that process while still providing continuity and reliable monitoring.
In accordance with yet another aspect of the invention, there is provided a process monitor for such a computer system, the process monitor being configured uniquely to determine to identity of a monitored process, to verify correct operation of the monitored process, and, in the event of being unable to verify correct operation of a monitored process, to cause the monitored process to initiate (to restart).
In accordance with yet a further aspect of the invention, there is provided a process to be monitored, for example in the form of a configuration management system for such a computer system, which is configured, on being initiated (started) by a process monitor, to check that it is operable; and, if so, to provide an indication of this to the process monitor prior to detaching itself from the process monitor.
An embodiment of the invention thereby seeks to provide a solution to the limitations of the prior approaches by providing a process monitor that can monitor the health (or successful operation) of one or more monitored processes that are not children of the process monitor. The process monitor seeks uniquely to identify a monitored process. If successful, it then carries out checks on the process to ensure that the process is still operating. In the event that the process has died, the process monitor then restarts the monitored process. Checks are performed on the monitored process (for example the monitored process may perform self-tests) to ensure it can proceed, before indicating to the monitor process that it has successfully started. An embodiment of the invention enables monitoring of processes without relying on a parent-child relationship and permits new or upgrade versions of processes to be started and for control to be passed reliably from an old process to a new or upgrade process.
The step of determining the identity of a monitored process can involve accessing a pre-determined file or other location containing the process identification information, which is unique to the monitored process. Each monitored process can be arranged, on initiation, to write its process identification (PID) information to the file so that it is then available for the process monitor to access. If the process monitor is unable to access the file, or accesses the file and does not find a PID for a process which it expects to find there, the system has no information relating to that PID and it will cause the monitored process to be initiated (started).
The restarting of the monitored process is preferably effected in two steps.
The first step causes the monitored process to start up and to perform checks to ensure that it should be operable (i.e. able to execute or function successfully). This can involve, for example, verifying that it can correctly establish a database needed for carrying out its various functions. The monitored process can be very critical of its operability at this stage, so that it does not continue if there are potential faults.
If the monitored process is not able to execute successfully, it can be arranged to handshake with the process monitor, indicating that it could not execute. The process monitor can be arranged to issue an error message to the user indicating that some kind of manual intervention is necessary to fix the problem, which causes the monitored process to fail. No further attempts are made to start the monitored process until the manual intervention has been completed.
Alternatively, if the monitored process was able to execute successfully, the second step in restarting the monitored process occurs. The monitored process writes its PID to the predetermined file and then handshakes with the process monitor indicating that it is able to execute successfully. The process monitor then proceeds to monitor the new monitored process.
This mechanism ensures that the monitor process will not thrash in order to try to get a faulty monitored process running. For example, if the monitored process is a CMSD and it is attempting to operate on erroneous CMSDEFs, then the initial monitored process (CMSD) would terminate with an error message to the process monitor. The process monitor would then be arranged to issue an error message to alert an operator to the fact that the monitored process would not run. Without the two step process, the monitoring process could thrash while trying repeatedly to start a CMSD that had failed immediately on start-up, for example as a result of a configuration problem. The two step process avoids the process monitor needing to have to differentiate between a CMSD that failed randomly and one that would always fail. It should be noted that the invention finds particular, but not exclusive, application to the operation of a CMSD as the monitored process.
In accordance with another aspect of the invention, a similar approach to starting a new process is involved when one process spawns another, for example to take account of system changes. An example of this is where an upgrade version of a monitored process (an upgrade process) initiates a new CMSD to accommodate changes to the CMSDEFS. The old CMSD and the new CMSD perform checks to ensure that the new CMSD can run in a stable manner. Only when this has been confirmed, does the new CMSD write its PID to the PID file and request the old CMSD to terminate.
In a particular embodiment of the invention, a CMSD (the monitored process) writes its PID to a file known to the process monitor. The process monitor then reads this and uses it to access the CMSD process information at regular intervals from the processor file system. Should the process information indicate that a CMSD is no longer alive, an alarm is asserted and an attempt made to spawn a new CMSD. The CMSD always xe2x80x98backgrounds itselfxe2x80x99 (i.e. it forks, then the parent exits), so the monitor will still not be the parent. In order to avoid system thrashing (i.e., continually restarting a CMSD that is unable to run because of its configuration or environment, for example), the newly started CMSD performs xe2x80x98self testsxe2x80x99 before xe2x80x9cbackgroundingxe2x80x9d itself. By xe2x80x9cbackgroundingxe2x80x9d itself is meant that it detaches itself from other processes so as to operate independently in the background. The success or otherwise of these tests is passed back to the monitor using the parent""s exit code, and should the tests have failed, further restarts are suppressed. Once a successful CMSD is in place (either by external intervention, or because CMSD was successfully restarted by the monitor), the alarm is de-asserted and the monitoring continues. If the CMSD needs to be upgraded, a protocol exists to allow a new CMSD to take over from the old CMSD without interrupting the service. When this happens, the monitor must switch from monitoring the old CMSD to monitoring the new CMSD safely. This is achieved by the new CMSD writing its PID to the file (for the benefit of the monitor) only when it has successfully taken over the service, and immediately before it instructs the old CMSD to exit.
The process monitor can interrogate the operating system to verify correct operation of the CMSD. As an alternative, the process monitor could test whether the CMSD is functioning by making service requests to the CMSD. Such an approach, while providing a higher degree of security than interrogating the operating system, would involve a higher overhead due to the extra processing by the CMSD.
The process monitor and/or the monitored process can be in the form of computer programs comprising computer code, or instructions, defining the functionality of the process monitor and/or monitored process.
Accordingly, an aspect of the invention also provides a carrier medium carrying process means for controlling a process to be monitored for a computer, the process means being configured, on being initiated by a process monitor, to check that it is able to operate successfully, and, if so, to provide an indication of this to the process monitor prior to backgrounding itself.
An aspect of the invention also provides a carrier medium carrying process means for initiating a process to be monitored for a computer, the process means being configured, on being spawned by an existing monitored process, to check that it is able to function correctly, and, in response to a positive result to the tests, uniquely to identity itself to the system and to terminate the existing monitored process, whereby the new process becomes the monitored process.
An aspect of the invention further provides a carrier medium carrying process means configured to define a process monitor for a computer, the process monitor being configured uniquely to determine to identity of a monitored process, to verify correct operation of the monitored process, and, in the event of being unable to verify correct operation of a monitored process, to cause the monitored process to initiate.
The carrier medium can be any form of carrier medium for carrying computer program code, whether that be a magnetic, optical or any other form of data storage such as a tape, disk, solid state, or other form of storage providing random or read-only or any other form of access, or a transmission medium such as a telephone wire, radio waves, etc.