As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option is an Information Handling System (IHS). An IHS generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes. Because technology and information handling needs and requirements may vary between different applications, IHSs may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in IHSs allow for IHSs to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, global communications, etc. In addition, IHSs may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
Generally speaking, an IHS performs operations by executing software code. Software is written by a developer and often includes a portion dedicated to handling error cases that occur during execution. In that regard, the inventors hereof have determined that, in most software products, the portion of the code that handles errors is itself the least tested.
When error paths are not rigorously tested (or not at all), it creates uncertainty for customers and a stream of field issues. Moreover, increasing product complexity means that software applications have also become more complex. For example, what used to be one running process that controlled most hardware functionality has now exploded into many independent processes, each communicating with each other over various Inter Process Communication (IPC) channels.
The inventors have recognized many error paths need to be tested for, and recovered from, in environments that involve a number of independent processes. These include, for example: testing recovery when one process of a group of process hangs or crashes, determining what happens if an IPC channel drops or delays messages, determining how the embedded application handle Operating System (OS) errors (e.g., running out of memory or persistent storage), etc. Accordingly, the inventors have identified a need for systems and methods to provide automated system-level failure and recovery that ensure, among other things, that any configurable selected set of failure conditions is tested.