An error is an unexpected condition, result, signal, or datum in a computer system or network. A fault is a defect that may produce an error. When an error occurs that is attributable to a fault, logging the error and fault information is helpful to diagnose the fault, and effect recovery.
Conventional methods of logging system information include obtaining messages that provide information about the overall system, obtaining subsystem-specific kernel statistics, and polling subsystems to extract other subsystem-specific data. The aforementioned methods of logging system information may be coupled with a conventional notification scheme to allow data to be obtained when specific events occur. The system information obtained using the conventional methods is typically processed either using a system-specific diagnosis engine or read by a user, to determine the corresponding fault associated with the error.
In general, the conventional methods that provide information about the overall system, typically centrally log events that occur anywhere in the system. The conventional methods log events such as password changes, failed login attempts, connections from the internet, system reboots, etc. Information about the events is typically recorded in text format in a log file or series of log files, depending on how a utility performing the conventional message logging is configured. The individual events that are recorded are referred to as messages. In some cases, the utility performing the conventional message logging is configured to incorporate various types of information into the messages. For example, the messages may include information regarding the severity level of the event.
Further, the utility performing the conventional message logging may also be configured to indicate the general class of a program, commonly referred to as the “facility,” that is generating the message. The standard facility names include: “kern” indicating that the message originated from the operating system, “user” indicating that the message originated from a user process, “mail” indicating that the message originated from the mail system, “auth” indicating that the message originated from a user authentication process or system, “daemon” indicating that the message originated from a system daemon (e.g., ftpd, telnetd), “syslog” denoting internal syslog messages, etc.
The following is a code sample showing an example of a message obtained using conventional message logging.
Code Sample 1 Jun 19 19:13:45icein.telnetd[13550]:connect fromalpha.eng.brandon.edu  (1)(2)(3)(4)(5)Segment (1) corresponds to the date and time of the logged information, segment (2) corresponds to the system the message originated from (i.e., ice), segment (3) corresponds to the process the message originated from (i.e., in.telnetd), segment (4) corresponds to the identification number of the originating process, and segment (5) corresponds to the body of the message.
Utilities performing the conventional message logging may also include functionality to perform a certain action when a particular event occurs. The actions may include, for example, logging the message to a file, logging the message to a user's screen, logging the message to another system, and logging the message to a number of screens.
As noted above, information about system events may also be obtained using subsystem-specific kernel statistics. The utilities that obtain subsystem-specific kernel statistics typically examine the available kernel statistics on the system and reports those statistics which are requested by the user. The user typically specifies which statistics she would like to see using a command line interface. An example of output obtained using a typical utility for obtaining subsystem-specific kernel statistics is shown below.
Output Samplecpu_stat:0:cpu_stat0:intr29682330cpu_stat:1:cpu_stat 1:intrblk51   (1)  (2)  (3)   (4) (5)Segment (1) corresponds to the module where the statistic was generated. Segment (2) corresponds to the instance of the module in the system, in this particular example there are 2 instances (0 to 1) of the cpu_stat module. Segment (3) corresponds to the name of the module instance (e.g., cpu_stat0) from which the statistic is obtained. Segment (4) corresponds to the particular kernel statistic within the module (e.g., intrblk). Segment (5) corresponds to the fractional second since the particular process tracked by the system has been operating, in fractional seconds, since the system booted. While the above example provided fractional time of a given process within a module has been operating, the utilities providing subsystem-specific kernel statistics typically include functionality to provide additional kernel statistics including user access, system calls, wait times, etc.
The aforementioned means of obtaining system information, may be used with notification schemes to obtain system data when a particular event occurs. The notification schemes typically use traps to trigger the recording of an event.