Logs are used for understanding what went wrong with a computing system when errors or other fault conditions occur in the computing system. Typical logging behavior includes writing a log message to a local text file immediately after an event occurs in a component (e.g., an application). In general, a location of the text file is specific to each instance of a component. That is, components are often from different vendors and thus store log message text files in different locations in a file system. Consider an environment such as an application cluster. A cluster is a grid of multiple components running among multiple systems that work collectively as a single computing system. Accordingly, each of the multiple components generates log messages that are logged in isolated files as configured by each of the multiple components. Thus, log messages for different components are kept independent of other logging activity in a cluster system.
Furthermore, consider that some clusters run on only a few hosts with components organized into hierarchical relationships. If one component fails, then other dependent components on the same host also fail, thus finding a source of the failure includes identifying a failed host and manually reading just a few log files. However, as a cluster system is scaled out to run on hundreds of hosts, manually seeking and reading log files becomes more difficult. Additionally, components may no longer be strictly hierarchical and instead may communicate and share states with peers. Thus, when one component fails on one host, peer components on other hosts may also fail since distributed components may propagate failures across hosts. As a result, one failure may cause cascading failures throughout the cluster system.
Consequently, if a cluster operates across one hundred hosts, and uses eight components on each host, then there may be at least eight hundred log files that characterize what is occurring in the cluster system. Accordingly, a cause of a failure may not be obvious since relevant log messages are buried in the multiplicity of log files.
Typically, collecting log messages in a cluster system may use centralized logging for all components in the cluster system. However, centralized logging for the cluster system encounters a performance bottleneck because all logging activity is sent to a single host which is responsible for capturing all log activity across a cluster and writing the logs to a disk. This bottleneck is exacerbated when the cluster system is in a failure state and generates a lot of logging activity.
As a result, errors in a cluster system caused by one faulty component can create significant difficulties. For example, administrators have to spend enormous amounts of time finding “Patient Zero” (i.e. the failing component that created the error state in the cluster). A postmortem typically requires administrators to physically view log files for all relevant components and correlate them by time manually. Thus, diagnosing problems is a complex and time consuming task. Collecting and interpreting so many logs manually is prone to errors where, for example, the wrong logs are collected and relevant information is missing.
Additionally, cluster administrators typically file bug reports to vendors by bundling many logs together with very little accompanying information. The vendor is then tasked to find “Patient Zero” with even less context about the failed cluster system.
The time spent between gathering logs, communicating the logs as bugs to vendors, and reiterating the process when vendors ask for more or missing information causes a turn-around time for understanding the difficulty and offering a solution to be very slow. This difficulty is only growing worse with the advent of cloud technology. For example, clusters may run in a cloud environment, which is elastic and where hosts serving the cluster are transient. That is, clouds are dynamic environments where a number of hosts at any given time may change. Accordingly, cluster administrators also need to know what systems were running at any given point in time, which further complicates logging activities.