The current state of technology remediation is that, when computer process, computer hardware or software breaks, people gather resources and execute failsafes and contingency plans to recover the broken technology (i.e., the broken computer process, broken computer hardware or software). Workarounds and typical break-fix activities are the mainstays of technology remediation and make up the best practices for how to recover technological services when something goes awry. The aim of these recovery plans is address three metrics commonly used to indicate the efficacy of a technology remediation system: mean time to detect (MTTD); mean time to repair (MTTR); and mean time between failures (MTBF). An effective technology remediation system implements processes that reduce MTTD and MTTR, while increasing the MTBF.
There are several commercial systems with offerings, such as Zabbix that allow a computer system “break-fix” to be paired with a “Response.” These commercial offerings, however, tend to require specific break events to trigger a single response. The evolution of technology services (e.g., computer systems that implement services and applications) means that the technological environments, technology and their frameworks are becoming increasingly complex. Moreover, the identification of any single “root causing break event” may be obscured by cloud-based services such as Amazon web services (AWS), Microsoft Azure, Oracle Cloud, Apache Hadoop, or Google Cloud platform, cross connections with physical hardware-based networks, and the many development frameworks and different coding languages that make up even simple applications. Presently, the determination of where a root-cause source of a technology problem is substantially an all-human experience driven, and humans are slow providers of “production system support” and “incident triage.”
While there are multiple chaos testing systems coming into the market, these systems typically interject some outcome of a disruption, for example, CPU utilization spikes to 100%, or network traffic is cut off. While these testing systems provide interesting tests to highlight whether a system supposedly designed to be resilient is truly resilient, the testing systems are impractical representations of what happens with operational technology products and computing systems. For example, a drop in network traffic can occur but the “Why” it occurred and the “How” it occurred are not realistic representations of an actual system when tested by the presently available chaos testing products.
It would be beneficial if a system or process was available that enabled network architecture optimization by identifying interdependencies and utilizing scoring techniques to further identify effects of system degradation and/or resiliency.