Computer systems have become increasingly complex, and the applications to which computers are applied have become more varied and widespread. While once confined to commercial organizations, manufacturing companies, and financial institutions, computer systems are now found in most small businesses and households. Indeed in the United States it is not uncommon for a household to have many computer systems and other computational devices. Companies are now as likely to use their computers to communicate with other business entities as they are to use computers within their own organization. Business-to-Business (B2B) and Business-to-Consumer (B2C) applications are commonplace, and the latest enterprise-level systems are designed to serve any number from millions, to hundreds of millions, of potential users.
The more complex a computer application is, the more likely it needs to generate, utilize, and share huge amounts of data. The net result is that computer hardware, software, and data storage offerings have grown to keep pace with technological needs. Today, a sophisticated enterprise-level computer system may include hundreds of processors, operating upon a variety of operating systems and application servers, with many network links to the outside world, and a considerable amount of fault-tolerant (for example, Redundant Array of Inexpensive Disk, or RAID-based) disk storage.
However, while the increased use and complexity of computer systems has provided great benefits, these are not immune to challenges. Foremost among these challenges is the fact that computer systems, even the most expensive and well-designed enterprise-class systems, can sometimes fail. These failures may be hardware-based, such as the failure of a disk drive or a computer memory chip. Failures can also be software-based; for example a software application that exhibits bugs and ends up hanging due to running out of memory. Another example may be that of an entire computer crashing due to a buggy device driver that mismanaged its in-memory data structures. In many instances, failures arise from a combination of both hardware and software problems. A Gartner survey estimated that software failures cause approximately 40% of outages in large-scale, well-managed commercial systems for high-end transaction processing servers, and for systems in general. When software-induced failures and outages do occur, their effects are compounded by the fact that a large percentage of the software bugs that manifest themselves in production systems have no known available fix at their time of failure. According to one source (Wood: “Predicting Client/Server Availability”, IEEE Computer, 28(4):41-48, 1995), this percentage of unknown-remedy bugs may account for as much as 80% of all software failures.
Given sufficient time, a software application can indeed mature and become more reliable, and less-failure-prone. This is how, for example, the U.S. public switched telephone network is able to provide its legendary high availability. It is estimated that only 14% of switched telephone network outages between 1992-1994 were caused by software failures; the third most-common cause after both human error (49%) and hardware failures (19%). (Kuhn: Sources of failure in the public switched telephone network. IEEE Computer, 30(4):31-36, April 1997). These statistics might suggest that a thorough design review and extensive testing could single-handedly improve the dependability of software systems. However, this is rarely the case; and indeed there appears to be a significant limitation to how truly free a software program can be of all bugs. Researchers and engineers have improved programming languages, built powerful development and testing tools, designed metrics for estimating and predicting bug content, and assembled careful development and quality assurance processes. In spite of all these developments, many deployed software applications are still far from perfect. It is estimated that two-thirds of software bugs that manifest in deployed systems could not have been readily caught by better testing processes (according to a U.S. National Institute of Standards survey).