In some operating systems, “drivers” are software modules that can be inserted into an operating system kernel, allowing for support of specific hardware or for extension of the operating system or both. Generally, drivers run in a fully trusted mode, whereby any failure in these components can cause machine services to fail, or a full system crash. Thus, any successful effort to make drivers more resilient or fault tolerant usually causes greater system reliability and consequently customer satisfaction to increase.
One of the barriers to greater driver resilience is that a driver typically has to respond to many “events” generated by the operating system which may require the driver to initiate operations which can fail. For example, these events may be file handle creation, device insertion, power being turned off, statistics gathering, and so forth. Most of the time, the exact action that a driver should take in response to an internal failure is poorly defined. This is partly due to the operating system not always being designed to handle every conceivable set of failures, partly due to external documentation not covering every situation and partly due to certain failures that involve a large amount of judgment on the part of the driver designer. Furthermore, drivers are often constructed internally as large “state machines” wherein a response to an event will depend largely on which events have occurred in the past. After a failure occurs, the driver designer often has to immediately turn around and handle new events, even though the failure probably implies that new events are likely to fail as well.
Currently, there is no efficient manner within some operating systems to easily cause a driver to reset its internal state. This implies that the driver's internal state machine should be designed to process failure events at every moment possible. This vastly increases the complexity of the driver, often doubling or tripling it in size in terms of the amount of code necessary to support such failure events. This increase in size also implies an exponential growth in the amount of time necessary to test and debug the driver, as most of the code paths within it may never be executed unless there is some failure in the operating system, the hardware or the driver itself. Since failures are not the normal case, mistakes in error handling code paths often go undetected.
Driver designers may cope with failure events in one of several ways. This may include:
1) They ignore some or all errors. This leads to, for instance, machines that crash as soon as some application requests all of memory.
2) They attempt to request as much of the resources that they will possibly need (e.g., memory, registry handles, and so forth) at start up time, minimizing the number of possible failure paths at run time. This may lead to wasted memory, and other frustrating situations, such as the need to reboot a machine to upgrade a driver.
3) They spend years “hardening” their drivers, handling every error path. Today, this is generally only performed by designers who are trying to run operating systems in data center environments, as they are typically the only ones who can afford the cost of such hardening.