Customers are provided with numerous types of online services. However, maintaining a high level of reliability and uptime for the online services is difficult. These online services constantly face unplanned downtime and interruptions that impact accessibility for customers. When these outages occur, on-call engineers need to be alerted to these outages.
Unfortunately, the current alerting mechanisms are deficient. FIG. 1 illustrates a block diagram of example communications 100 associated with a conventional alerting mechanism. As illustrated, at step 110 the on-call engineer answers a telephone call. In response to the on-call engineer answering the call, at step 120 the alerting mechanism states “Hello this is Bob calling with an important message from ABC online services.” Further, the alerting mechanism prompts the on-call engineer to “please unlock the keypad when necessary” at step 130. At step 140, the alerting mechanism then identifies information about the alert. Specifically, the alerting mechanism states “We have received an Alert ID: 94cd242a-92m5-2842-782482499a. The Alert was raised on Dec. 5, 2016. OnlineServicePingProbe Probe targeting w2.example01.expr01gw112 last failed at ‘12/5/2016 7:15:53 PM’ with result name OnlineServicePingProbProbe/example01/OSRR01DG112/OSRR01DF112-db116′.” Further, the alerting mechanism prompts the on-call engineer to “please enter 1# to acknowledge this message” at step 150. The on-call engineer responds “1#” at step 160. However, if the on-call engineer wishes to learn more about the error impacting their service, they have no recourse but to manually exit the alerting session and perform manual service-specific investigation and remediation steps.
As illustrated above, the existing alerting mechanism merely informs the on-call engineer about an outage. Unfortunately, the existing alerting mechanism generates painstakingly long notification messages that include voluminous sequences of numbers and letters that frequently results in the on-call engineer summarily dismissing the notification before receiving the substance of the alert. While a significant portion of the information conveyed in the alert may be of some importance to the computing devices, the information provides the on-call engineer with limited information regarding the outage.
After receiving the alert from the alerting mechanism, the on-call engineer must then diagnose and remediate the issue outside of the alerting workflow. Frequently, this requires the on-call engineer to decipher the alert and investigate details about the issue in order to identify the relevant computing device and potential remediation actions.