In complex systems, such as storage arrays and other distributed systems, there may be a set of events (e.g., fault or error conditions) that occur infrequently during normal operation. For example, race conditions may infrequently occur between multiple threads or processes. Such rare situations may take years of testing and quality assurance to identify, recreate, and validate a fix for. Moreover, isolating offending code may be ineffective to recreate/fix problems that only occur as a result of multi-process, multi-host, or multi-system interaction.
One approach for reaching infrequent error conditions is running storage systems for a long time until the rare scenario is eventually reached while introducing the external factors that lead to it (e.g., removing disks from a storage array).