Field of the Embodiments
The present invention relates generally to computer networks and, more specifically, to collaborative incident management for networked computing systems.
Description of the Related Art
Modern data centers and other computing environments can vary from a few computing systems to thousands of computing systems. Each computing system in such computing environments is generally configured to process data, service requests from remote clients, and to perform other computational tasks. During operation, a particular computing environment may experience degraded performance, failure, or other issues in regard to one or more aspects of the computing environment, referred to herein as “incidents.” Incidents may result from problems internal to the computing environment, such as failure of one or more components, corruption in one or more database entries, or increased traffic on a communications network that connects one or more computing systems within the computing environment. Alternatively, or in addition, incidents may result from attacks by malicious users operating outside the computing environment. In one example, one or more application programs executing within the computing environment could exhibit increased response time. In another example, particular services, such as email communications, could fail entirely, thereby disabling the ability of users of the computing environment to transmit or receive email messages.
When an incident occurs, a system administrator or other responsible person, referred to herein as the “incident manager,” is alerted that an incident has occurred and that resolution of the incident is needed. The incident manager forms a response team by contacting the individuals needed to resolve the incident. The incident manager contacts these individuals via various means of communication, such as telephone calls, email communications, text messages, and pager messages. The incident manager opens various communications channels, data sources, and application programs that are relevant to resolving the incident, such as a chat message application program, an online meeting application program, and a software bug tracking application program. The incident manager may set up an email alias to enable the individuals working to resolve the incident to communicate with each other. Further, the incident manager sends one or more communications to key stakeholders, such as the company president and key managers, to advise the key stakeholders about the incident and what is being done to resolve the incident. As the incident evolves, the incident manager may add individuals to the response team as more issues are identified and may remove individuals from the response team as issues are resolved. The incident manager remains in contact with the individuals working on the incident to assign tasks, receive status updates, and coordinate response team efforts until the incident is resolved. The incident manager also remains in contact with the key stakeholders to provide summary status updates until the incident is resolved.
One potential drawback with the approach described above is that the incident manager and response team members need to continually monitor and update multiple communications channels, data sources, software application programs, and email messages in order to determine the current state of each task related to resolving the incident. If the incident manager cannot determine the status of a particular task, the incident manager may need to text or call the task owner to receive the latest status. Monitoring and updating multiple disparate communications channels leads to inefficient team communications, resulting in a reduced amount of time devoted to actually resolving the incident.
An additional potential drawback with the approach described above is that the incident manager generally needs to manually send status updates to key stakeholders on a regular basis until the incident is resolved. If the incident manager fails to provide regular status updates, then one or more key stakeholders may need to contact the incident manager to determine the current status of the incident. As a result, the incident manager generally needs to devote some amount of time to generate and transmit status updates on a regular basis, thereby leading to additional inefficiencies and delay in resolving the incident.
As the foregoing illustrates, what is needed in the art are more effective ways for response teams to collaborate to resolve incidents in networked computing environments.