1. Field of the Invention
The invention relates to the field of process monitoring and fault detection in a Unix environment.
2. Description of the Related Art
Unix servers are used extensively throughout industry for web servers, application servers, database servers, and other mission critical enterprise system components. Abnormal terminations (abends) of programs and processes: 1) lead to lost sales or production; 2) leave internal users idle; and 3) require development resources to work on outages versus new development and maintenance. These outages, regardless of size, impact productivity of the organization and collectively have a large cost associated with them. Unix systems, such as the Solaris™ OS from Sun Microsystems of Santa Clara, Calif., have provided no method of automatically detecting when a program terminates abnormally and then locating and collecting key information about the program and the cause of termination.
Most operating systems include fault handling utilities for generating an error message and a simple core dump in the event of an abend or interruption in processes running under the OS. While a core dump is useful for troubleshooting and analysis after an abend, it may not capture all information desired, it does not organize the captured information in an intuitive fashion, and it does not provide robust analysis, notification, or archiving of captured information. In order to provide a more complete fault handling solution that integrates with the workflow of developers, maintainers, and administrators, many fault analysis products have evolved for various OSs. For example, CICS ABEND-AID FX™, available from Compuware Corp. of Farmington Hills, Mich., provides extensive capture and analysis of abending processes in IBM mainframe environments, such as OS/390 or z/OS with CICS, MVS, or DB2 support. These fault analysis products provide fault management within a limited computing environment.
Enterprise computer systems have become increasingly complex with a large number of computer system contributing functions, data, communications, and user interfaces. All of these systems include processes that may generate interruptions, such as abends, time-outs, and other errors. Managing these interruptions is one of the primary challenges of administering an enterprise computer system. Fault administration applications have been developed to gather information on process interruptions throughout monitored systems. The gathered information can then be archived, abstracted, reported, used to generate notifications, or initiate real-time troubleshooting. Fault administration applications may use platform independent messaging and data formats to gather fault information from disparate platforms. For example, the fault administration application may include an event router on a computer system running a Windows NT OS from Microsoft Corp. and receive interruption event messages from both NT and OS/390 computer systems. Fault Manager 2.5 from Compuware Corp. is a fault administration application that supports event aggregation from multiple computing environments. Fault administration applications provide fault management solutions for multi-platform enterprise computing environments.
In order to provide fault management functions across multiple computer systems, a component for monitoring processes within each system is typically provided. The process monitoring components are customized to the OS running on each system. For example, a Windows NT compatible monitoring component would be present on each Windows NT system and z/OS compatible monitoring component would be present on each z/OS system. The monitoring components must be capable of identifying the abnormal termination of processes running on their respective systems. The monitoring components provide notification of an abending process to the fault management application. The monitoring components also assist in identifying the location of key information regarding terminating processes, so that the information can be collected for use by the fault management application.
Unix operating systems lack well-documented APIs for identifying and monitoring processes and capturing their abend information. Some example Unix-based operating systems include: AIX, A/UX, BSD, Debian, FreeBSD, GNU, HP-UX, Linux, NetBSD, NEXTSTEP, OpenBSD, OPENSTEP, OSF, POSIX, RISCiX, Solaris, SunOS, System V, Ultrix, USG Unix, Version 7, and Xenix. A Unix environment is one or more computers running a Unix-based operating system and associated application programs and data sources. There is a need for a fault monitoring component that identifies abending processes and assists in reporting and capturing abend information from within a Unix environment. Such a fault monitoring component would allow fault management in Unix environments and through multi-platform fault administration applications.