Current state of technology remediation is that, when computer process, computer hardware or software breaks, people gather resources and execute fail safes and contingency plans to recover the broken technology (i.e., the broken computer process, computer hardware or software). Workarounds and typical break-fix activities are the mainstays of technology remediation and make up the best practices for how to recover technological services when something goes awry. The aim of these recovery plans is address three metrics commonly used to indicate the efficacy of a technology remediation system: mean time to detect (MTTD); mean time to repair (MTTR); and mean time between failures (MTBF). An effective technology remediation system implements processes that reduce MTTD and MTTR, while increasing the MTBF.
There are several commercial systems with offerings, such as Zabbix that allow a computer system “break-fix” to be paired with a “Response.” These commercial offerings, however, tend to require specific break events to trigger a single response. The evolution of technology services (e.g., computer systems that implement services and applications) means that the technological environments, technology, and their frameworks are becoming increasingly complex. Moreover, the identification of any single “root causing break event” may be complicated by cloud-based services such as Amazon web services (AWS), Microsoft Azure, Oracle Cloud, Apache Hadoop, or Google Cloud platform, cross connections with physical hardware-based networks, and the many development frameworks and different coding languages that make up even simple applications. Presently, the determination of where a root-cause source of a technology problem is substantially an all-human experience driven, and humans are slow providers of “production system support” and “incident triage.”
As a result, different types of production system support methodologies have been implemented to compensate for the human shortfalls. Across tech organizations, production system support functions use manually-generated and manually-maintained document libraries, called Runbooks, that are used to identify a problem via integrated monitoring and deploy a fix. These production system support functions are siloed to specific applications that have such documentation.
For example, one production system support process may be termed “Fix on the go.” In a fix on the go process, engineers may make weekly/monthly rotations to support issues for 24 hours, 7 days a week. In response to detection of an application-specific issue, a support team member pages one of the engineers in the “on call” group. The engineer on call will access via a graphical user interface an incident bridge that lists issues, attempt to understand the issue and implement a fix using an emergency change process. This is a slow labor-intensive process and does not help reduce MTTR.
Another production support process utilizes document-based (where a “document” can be an actual document, on online file, help system or some other reference source) operational runbooks that a development team/support team documents steps to fix known/recurring issues. The document operational runbooks save some time but are not a significant improvement as an engineer needs to understand the procedure during an issue and implement the steps of fixing the known/recurring issues. There is always a chance of human error with either the understanding of the procedure or the implementation of the steps fixing the known/recurring issues. Related production support processes that automate the runbook (keep the scripts on some server/repo) offer slight improvement, but these processes still rely on a human to find a cause and trigger a fix from the corresponding runbook.
Some automated systems rely heavily on operator expertise to correctly identify the problem, its solution, and deploy it as quickly as possible. When expertise, understanding of the broken system, and/or ability execute the fix are lacking, the brokenness escalates throughout a computing system and begins to impact upstream and downstream systems as well. This chain of up and downstream systems is called “interdependency.”
Time is of the essence in nearly all remediation instances, but without proper resources, technology systems are subjected to lengthy and investigative triage. The fix is typically done in a silo of the impacted system first, which places the interdependencies at risk of ongoing impact and delay in restoration of service. This siloed focus on a single break event complicates root cause in the interdependent system chain and can lead to false positives where any single failure is fixed, but a root cause remains unaddressed in a systemically broken set of technology services.
The evolution of cloud-based services further complicates the technology remediation solutions as a common solution is to continually create new instances (where an instance can be the cloud equivalent to any application or network service entity) and destroy the older instances of the less robust services or applications before an issue arises “just in case.”
Interdependent systems further complicate break-fix stability. As system complexity increases, the definition of “working” and “broken” get blurred as does the visibility of past break-fix events as they correlate to any present events. The interdependency of systems further reduces an engineer's ability to completely understand the effects of a fix to one system on another system when the applied fix affects multiple processes of the other system.
It would be beneficial if a system or process was available that identified the optimal “fix” to a process break that accounted for interdependencies between systems and processes.