1. Technical Field
The present invention relates to an improved data processing system and, in particular, to a method and system for data processing system reliability.
2. Description of Related Art
A computer includes both a physical machine, namely the hardware, and the instructions which cause the physical machine to operate, namely the software. Software includes both application and operating system processes. If the process is simply to do tasks for a user, such as solving specific problems, it is referred to as application software. If a process controls the hardware of the computer and the execution of the application processes, it is called operating system software. System software further includes the operating system, the process that controls the actual computer or central processing unit (CPU), and device drivers that control the input and output devices (I/O) such as printers and terminals.
A number of application processes are usually present waiting to use the CPU. The operating system determines which process will run next, how much of the CPU time it will be allowed to use, and what other computer resources the application will be allowed to access and use. Further, each application process will require a special input or output device and the application process must transfer its data to the operating system, which controls the device drivers.
However, frequently these processes fail. When such a failure does occur, either the task, in the case of an application process, or the computer system, in the case of an operating system, will terminate operation. There is presently no mechanism for one computer process monitoring another process to detect when such a failure occurs. This restart capability currently must be performed by a variety of processes. At present, there is no mechanism for providing an automatic restart capability to ensure any processes experiencing software failure can be automatically restarted on such a failure. Furthermore, there is no mechanism which provides for a process to be enabled or disabled during the normal operation of the operating system.
Therefore, it would be advantageous to have a method to have mutual computer process monitoring and restart. There needs to be a process within a set of processes which monitors another process within the set of processes. Several cooperating computer processes ensure robustness in the event that one of the processes terminates abnormally.