In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users. At the same time, the cost of computing resources has consistently declined, so that information which was too expensive to gather, store and process a few years ago, is no economically feasible to manipulate via computer. The reduced cost of information processing drives increasing productivity in a snowballing effect, because product designs, manufacturing processes, resource scheduling, administrative chores, and many other tasks, are made more efficient.
With respect to the widespread use of digital data technology, two observations may be made. First, digital data processing systems have become, and continue to become, increasingly complex. This complexity applies not only to individual digital devices, but as is well known, digital devices are commonly connected to other digital devices in networks, so that a digital data processing system may be viewed as a single device or as a collection of devices communicating via one or more networks. Second, users, from schoolchildren to multi-national corporations, are increasingly dependent on the digital data processing systems they use.
Given the dependencies on data processing systems that users feel, there is a hope and expectation, which translates to a marketplace demand, for more reliable digital data systems. From the standpoint of the user, this demand is focused on the dependability of the system to perform some set of functions necessary to the user, i.e. to perform one or more services for the user. In general, as long as the service continues to be performed, the user is not greatly concerned about the details of irregularities occurring within the digital data system. Nor is the user greatly mollified by the news that the system is operating properly, if in fact the service is not being performed as expected.
Of course, one piece of the complex puzzle of reliability is the reliability of individual hardware components of a digital data system. A great deal of effort has been directed to the design of more reliable data processing hardware components and component assemblies, and it must be conceded that great progress has been made in this field. Additionally, effort has been directed to the detection of actual or impending failures of components, and the replacement or substitution of function thereof with minimal disruption to the operation of a larger data processing system of which the component is a part.
For any given data processing component, reliability can be further improved by redundancy, i.e., providing multiple components of the same type which perform the same function, and which are configured so that in the event any single component fails to perform its intended function, the remaining component or components can act in its place.
However, it is difficult to ascertain and guarantee reliability of large and complex data processing systems or networks of systems to perform some service which a user may expect. Although reliability of some individual hardware components may be known or assured, the very complexity of the system may make it difficult to identify the weakest link in the set of components needed to provide the service. Furthermore, while hardware components have greatly improved and redundancy may provide even further hardware reliability, the service will often be dependent on critical paths in software which is common to all computer processors or systems providing the service. Defects in the software are notoriously difficult to predict, and mere redundant hardware components will not necessarily prevent service interruption as a result.
A need exists, not necessarily recognized, for improved methods and systems for evaluating risk of service degradation where a service is provided by data processing resources, and particularly by a complex set of hardware and software resources coupled by one or more networks.