Outages of production IT services result in huge revenue or business loss for enterprises. Human error has been identified as one of the major factors behind system outages and network downtime, and the repair of such mistakes has been found to be highly time consuming. The characteristics of the mistakes are usually environment specific, but the most common mistakes include software misconfiguration and improper deployment of new or upgraded software.
Even though the role of human error behind system outages has been widely noted, there is little tendency to properly log and monitor system admin activities. This further aggravates the problem as the root cause is usually not detected and both the duration and frequency of an outage can increase as a result.
Tools have been developed that either track the problem symptoms (e.g. a network port going down or the death of a critical process or excessive resource usage) or track all system admin activities (shell history, etc). However, none of these tools correlate these two sets of information to find out what was caused by whom. Tools to track user activities such as a shell history file, audit trails and terminal typescripting using the “script” command either produce too much information without any hints of a potential outage scenario (e.g., terminal typescripting, audit trails) or too little information (shell history file) to be useful in a meaningful way.
On the other hand, products that offer change management solutions can track configuration changes in the system by creating a baseline after scanning the entire file system. Configuration changes can be figured out after comparing with the next system scan report. However, among the drawbacks of such solutions is that it does not become clear as to how many times a snapshot should be taken and compared with a baseline, as typically many configuration changes might well have been applied between two snapshots. Hence, it is difficult to pinpoint the exact configuration change that might have led to an outage. Since these tools just report the configuration changes, but do not report on the user action or process that made the change, a full diagnosis may not be possible.