With the increasing reliance of today's business on IT, enterprise IT systems need to maintain high levels of availability and performance. To achieve this, the health of IT systems is continuously monitored. Abnormal behaviors of components such as failures, anomalies, SLA violations, and outages are detected and alerts are generated. These alerts are then analyzed by a team of service desk personnel or resolvers and appropriate actions are taken to resolve the issue.
Present approach of generating and analyzing alerts is highly manual, ad-hoc, and intuition-driven. Further they are reactive. The alerts are configured by observing a single component in isolation and lack a system-wide view. These are often incorrect leading to either too many false alerts or missing many legitimate problems. Furthermore, the enterprise IT systems keep evolving due to changes in business and infrastructure. The manual alert configurations fail to adapt to these changes, thereby leading to stale and often obsolete configurations.
Also, managing batch systems is challenging because of the inherent scale and complexity. A typical batch system consists of several business processes, batch jobs, connected through complex interdependencies. Furthermore, outages and delays in batch jobs can lead to heavy financial losses. Hence, it is imperative to correctly monitor batch systems and ensure that all potential anomalies are timely captured and notified. Herein, batch jobs and jobs have be used interchangebly throughout the description. In an example scenario, a batch system is configured to generate a variety of alerts. Some of the most common alerts are abnormally high job run times (MAXRUNALARM), abnormally low job run times (MINRUNALARM), delayed start of a job, delayed end of a job, job failures, and the like. The large scale and complexity of batch systems results in an increase in noise and redundant alerts. This makes the problem of generating the right alerts at the right time very relevant in today's batch systems.