In order to provide a high-throughput of work, or to ensure nearly continuous availability, distributed computing systems are often utilized. A distributed computing system includes two or computers or other processors which frequently operate somewhat autonomously and communicate with each other over a network or other communication path.
A component of a distributed system that has the capability of sharing resources is often referred to as a cluster which has two or more nodes, each node having a processor or at least a processor resource, and typically, a separate operating system. One example of a distributed computing system utilizing one or more clusters is the IBM System Storage TS7680 ProtectTier Deduplication Gateway, which is a virtual tape library which appears to applications as one automated tape library. The distributed computing system of the TS7680 also usually includes several controllers which communicate with the clusters over a network.
In a large cluster environment, it is often desirable for a system administrator to be able to view significant events of various nodes of the system from a central location, often referred to as a central management node. This often can be difficult to do, however. Normally, significant events are represented by a log entry in a particular log file on a node in the distributed system where the event occurred. Should all log entries in all log files on all the nodes in a distributed system be sent to the central management node, this could result in too much network traffic and too much data on the central management node. Conversely, if individual log files are maintained only on the nodes, however, the administrator may need to access many nodes to view all the pertinent logs when trying to resolve a problem.
Moreover, in some distributed computing systems, not all nodes may be accessible to any one administrator. For example, some nodes may be accessible only by authorized service personnel provided by the vendor of the components of that node. Many distributed computing systems utilize proprietary hardware and software which may differ from one component to the next. Thus, one system administrator may be called in to collect logs from one type of components such as controllers, for example, whereas a different system administrator may be called in to collect logs from a different type of component such as clusters, for example.
Still further, the manner of obtaining log reports may differ depending upon the type of node on which the log report resides, thereby making the administrator's task of gathering the log reports from various nodes more difficult. For example, the nodes of a distributed computing system often utilize different operating systems such as AIX, UNIX and Linux which may employ different types of log reporting. However, a system administrator may be more skilled in one type of operating system but perhaps less so with respect to another. Hence, a system administrator may be more adept at obtaining log reports from components having a particular operating system as compared to another type of operating system.
Also, the information contained in a log report may differ as a function of the system used to obtain the report. For example, log reports available to a system administrator may differ from log reports available to service personnel such that none of the reports may be complete.
There may be additional limitations on the ability of a system administrator to obtain a log report. For example, a local system administrator located away from a remotely located component of the distributed computing system may encounter difficulties in obtaining a log report generated by a remotely located component.
A log subsystem on operating systems such as UNIX and Linux, called syslog, has a forwarding mechanism that allows log entries of certain categories to be sent to a central location. However, if all log entries are not forwarded, some event entries of interest may be missed. Conversely, if all log entries are forwarded, entries of interest may be difficult to locate amongst all the other log data.