The invention relates generally to the field of digital computer systems and more particularly to arrangements for logging event information that is generated by various components of a distributed digital computer system, including large-scale mass-storage subsystems, to assist in diagnosing malfunctions. In particular, the invention provides a common event log that stores event information that is independently generated by a plurality of components of a distributed computer system in the order in which the events occurred so that, in the event a malfunction occurs, the log information may be used to assist in diagnosing the cause of the malfunction.
A number of facilities are available to assist in analyzing and diagnosing causes of malfunctions in complex digital computer systems. For example, interface signal analyzers, such as SCSI (small computer system interface) analyzers, optical fiber analyzers and the like are used to record and analyze signals transmitted over interfaces connecting the various subsystems comprising a complex computer system. These signals may be helpful in diagnosing hardware problems. These types of devices are typically not permanent components of a digital computer system, but instead are among tools used by field service personnel when performing maintenance on a computer system, and are brought with them to the computer system""s site and connected to the computer system while performing maintenance.
Interface signal analyzers, such as those described above, have only limited utility in diagnosing malfunctions which are internal to the various subsystems comprising a complex computer system or malfunctions which occur as a result of problems with software. To help diagnose these problems, subsystems often maintain event logs, in which they store certain information concerning their status at various predetermined points in time during their operations. By analyzing the information stored in the log, the detailed operations performed by the subsystems can be analyzed and compared to their expected operations, with malfunctions being diagnosed based on deviations of the actual contents of the log from the expected contents. The use of event logs to diagnose malfunctions can be very advantageous, since the event log information can be transmitted over telephone lines, for example, to a central field maintenance location for analysis, so that a diagnoses can be performed without the necessity of having field maintenance personnel actually at the sites of the computer systems being diagnosed.
The invention provides a new and improved arrangement for storing event information that is independently generated by a plurality of components of a computer system to assist in diagnosing the causes malfunctions which may occur.
In brief summary, a distributed computer system includes a plurality of computer nodes, including conventional digital computer systems, mass storage subsystems, servers and the like, and a common event log. The common event log includes a plurality of storage locations for storing common event log entries. Each computer node performs processing operations in connection with a program, and generates, at selected points in its program, an event log entry including status information representing status of the computer node at the point at which the log entry was generated, the computer nodes storing the event log entries which they generate in the common event log contemporaneous with the generation thereof. As a result, the event log entries are stored in the common event log in the order in which the computer nodes reach the points in their respective programs.
The common event log includes a buffer comprising a plurality of storage locations, and the location at which an entry is to be stored is pointed to by a write pointer. In one embodiment, the various computer nodes are interconnected by a common bus. When a computer node is to store a new entry in the common event log, it retrieves the write pointer, increments it and restores it in an atomic xe2x80x9cread/modify/writexe2x80x9d operation over the bus, and thereafter may use the write pointer which it retrieved to store the entry in the common event log.