Software engineering has seen recent advances in large-scale, distributed software systems. Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic. Software developers seek to identify the weaknesses of such distributed systems before they manifest as system-wide, aberrant behaviors that impact the performance of the system. Systemic weaknesses can include improper fallback settings when a service is unavailable, retry storms from improperly tuned timeouts, outages when an upstream or downstream service receives too much traffic, and cascading failures based on a single point of failure. Chaos engineering is the discipline of analyzing and improving the reliability such distributed systems by causing “chaos” to observe the behavior of the system during controlled experiments. Chaos can represent real-world events such as severed network connections, dropped packets, crashed servers, hard drive malfunctions, and spikes in network traffic, to name a few examples.