This invention relates generally to process monitoring software, and more particularly to providing a method and system for the fast detection of process outages in a server environment. Examples of processes monitored for outages include: application processes, operating system processes, and tool processes.
Computer servers are used extensively throughout industry for Internet servers, application servers, database servers, communication servers, and other mission critical enterprise system components. Computer servers configured for the aforementioned activities and applications require a high level of availability. A high level of availability requires the minimization of server downtime that results from abnormal or premature terminations of programs and processes. Server downtime leads to loss of sales, lower productivity, and unreliable communications. In order to minimize server downtime, process monitoring software is employed to detect abnormal and premature terminations of programs and processes.
An existing approach to maintaining a high level of availability of computer servers is the use of automation managers. An automation manager is a software component operating in the field of computing resource management, which may comprise the management of software resources as well as hardware resources. Information technology (IT) systems often employ a variety of automation managers to automate the handling of the system's vast quantity of resources. For example, a high-availability automation manager may be configured to check the availability of an application (e.g., software resource), and trigger recovery actions when an error occurs. A performance manager may monitor the application's performance data and may dynamically adjust application usage based on performance objectives. A provisioning manager may dynamically change systems able to interact with software resources and/or a set of software resources that are available to interact with. Resources, such as a software resource, may be on a stand-alone computing system, or on more complex systems, such as distributed computing environment, autonomic environments, clustered computing environments, virtual computing environments, etc.
Automation managers provide high availability and policy-based automation for applications and services across heterogeneous environments. An automation manager helps to meet high levels of availability and prevent service disruptions for critical applications. Having knowledge about dependencies between applications, automation managers initiate, execute and coordinate the starting, stopping, restarting in place, and fail over (restart on a different machine) of individual application components or entire composite applications to help reduce the frequency and duration of incidents that impact IT availability.
A basic prerequisite for automation is availability monitoring. The automation manager identifies applications, programs or resources (which are to be monitored) by process identifiers (pid). A pid is a unique positive integer assigned or associated to a process by the operating system, and is not reused until the process lifetime ends. Additionally, pids are utilized internally by processes to communicate.
The automation manager monitors processes via their pid. In the event a pid is no longer available, it is inferred that an associated application has terminated. For an application termination that is unexpected, such as a process failure, the automation manager reacts immediately to re-activate the application (restart in place or fail over to a standby node). The faster the automation manager detects the failure, the faster the application may be made available again. Fast detection is an important factor in achieving high application availability.
In order for an automation manager to manage an application, the automation manager requires, at a minimum, the following three pieces of information: a start command, a stop command, and a monitoring method. A standard monitoring method is starting the ‘to be monitored’ application process as a child process of the monitoring process and putting the monitoring process into a wait state. By starting the child process the monitoring process becomes the parent of that application process. It is standard operating system behavior to inform the parent (which is the monitoring process) in case the child terminates via sending a signal to the parent that wakes it up from its wait state. Then the monitoring process notifies in turn the automation manager. It is noted that this ONLY works for this parent/child process relationship. Other monitoring processes may be active, which cannot be the parent of a ‘to be monitored application process’. Such monitoring processes are ‘interested processes’ in the sense that they also want to be informed by the operating system when a ‘to be monitored’ process ends.
A more general standard monitoring method checks periodically whether a specific program is running by checking if the pid is active. An operating system (e.g. UNIX, Linux, Windows, etc.) maintains a list of all running/active processes.
Table 1 illustrates a typical UNIX/Linux process table. The process table has an application with process name “progA” that has been started by a user “jane” with an associated process id of 1716, and an application with process name “progB” that has been started by user “jim” with an associated process id of 1725. On Unix/Linux systems the process list may be shown via a ps (“process status”) command.
TABLE 1USERPIDCMDjane1716progAjim1725progB
The automation manager may also use a program interface (system call) instead of the ps command for process monitoring. A program interface is usually faster than the ps command and provides for a process table in binary format. Once a pid has been identified, the automation manager may subsequently check via a system call whether the process associated with the pid is still running.
FIG. 1 is a flow chart of an existing method for standard process monitoring with an automation manager. The automation manager has the following parameters: a process name, user name, and a monitoring interval. For example from Table 1, the process name is “progA”, the user name is “jane”, and the monitoring interval is 30 seconds. The monitoring method starts (block 100) with reading a process table from an operating system kernel (block 102). A kernel is the essential center of a computer operating system. A kernel acts as a core or nucleus that provides basic services for all other parts of the operating system. Kernel is a term used most frequently in UNIX and Linux operating systems, though similar concept exists in other operating systems under different terminology. Typically, a kernel (or any comparable center of an operating system) includes an interrupt handler that handles all requests or completed I/O operations that compete for the kernel's services, a scheduler that determines which programs share the kernel's processing time in what order, and a supervisor that actually gives use of the computer to each process when it is scheduled. A kernel may also include a manager of the operating system's address spaces in memory or storage, sharing these among all components and other users of the kernel's services. A kernel's services are requested by other parts of the operating system or by application programs through a specified set of program interfaces sometimes known as system calls.
Continuing with FIG. 1, the monitoring method of the automation manager searches for a user/process name in the process table, and retrieves and remembers an associated process identifier pid of the process to be monitored (block 104). There are other existing means to obtain the pid that is associated with the application. For example, the application itself may store the associated pid in a file, which is read by the automation manager.
The monitoring method determines (checks) if the pid is still active or running (decision block 108). In the event the pid is still running (decision block 108 is yes), the monitoring method continues to check the process table (block 108) at the defined monitoring interval (block 106). In the event the pid is no longer running (decision block 108 is No), the monitoring method notifies the automation manager that the application identified with the user/process name and associated pid is no longer available (block 110), and the monitoring method concludes (block 112).