This invention relates generally to computer system reliability studies, and more particularly to the monitoring and reporting of failure characteristics of software components of a computer, including individual servers, in a computer network.
A modem computer system is typically a complicated combination of software and hardware that has many different components for performing various functions and supporting various features. The optimal performance of a computer system frequently requires continuous monitoring and correcting problems identified through such monitoring to provide reliable operation. The need for reliability evaluation is present not only in operating an existing computer system but also in developing computer software and hardware products. For instance, during the development of an operating system, such as the xe2x80x9cWINDOWS NT(copyright)xe2x80x9d operating system by xe2x80x9cMICROSOFT(copyright)xe2x80x9d Corporation, various components of the operating system are constantly being tested by subjecting them to strenuous operating conditions and observing whether they can withstand the heavy usage without failure. Such a performance study, often termed xe2x80x9cstress testing,xe2x80x9d helps the software developers to identify the weak spots or defects in the components of the operating system and provides valuable information as to the causes of failure.
Many modem products are a combination of software and hardware. Testing such products is more difficult since the software should be tested against other software and hardware products, which may be unavailable if they are in development by third parties. Furthermore, the degree of reliability that can be assured for a software/hardware unit is sensitive to prior knowledge of likely uses to which the software/hardware unit is put.
It is common experience that computer software and hardware frequently fails even with extensive stress testing by the manufacturer to ensure reliability. An important reason is that all possible combinations of inputs, outputs and internal operations, in effect the universe of states of a computer or a network, are too large to be exhaustively tested. It is customary to stress test each component by overloading it by performing a particular task repetitively. Examples of such tasks for software include read and write operations, mathematical calculations and the like. The often unfulfilled expectation is that a system comprising different stress-tested components will continue to be reliable.
In the modern marketplace it is impractical for a single entity to supply all of the needs of customers. Thus, third-party products that inter-operate with a product to be tested are an unavoidable complication in the testing procedure. However, the cost of providing after sales service to make various combinations of products operational is not distributed amongst different manufacturers in proportion to their contribution to system failure. In particular, a supplier of an operating system is more likely to field support calls than the suppliers of software that uses the operating system because the average consumer is unlikely to accurately identify the true cause of a failure.
In addition, there is an underlying expectation that the operating system manufacturer should ensure some level of reliability. The substantial cost of testing and providing after sales service, including responding to complaints due to defective third party supplied software, has to be budgeted in the cost of manufacturing and marketing an operating system or related software. Consequently, it is not uncommon to encounter certification requirements placed by the operating system manufacturer for permitting claims of compatible products by other manufacturers.
Software developers of operating systems seek to include features and functions that they believe will make the corresponding hardware both more useful and easier to use, including the creation of additional software that uses the operating system. Not surprisingly, operating systems can be quite complex as they often include a variety of features. Examples of operating systems include the xe2x80x9cWINDOWS CE(copyright)xe2x80x9d for hand held devices and the xe2x80x9cWINDOWS 98xe2x80x9d operating system.
The market for operating systems may be conveniently divided in accordance with the complexity required from the operating system and the underlying hardware. Use of commonly available hardware results in lower costs due to increased competition between hardware manufacturers. Thus, a desirable low cost operating system should allow use of widely available hardware to better compete in the relevant market. A result of a division based on complexity provided in the operating system results in reduced expected support costs for testing and after sales service since simpler operating systems are likely to incur lower costs. Competitive concerns require that such cost savings be passed on to the consumer, when possible, to better compete in the marketplace. However, it should be noted that testing cheap or simple software or software/hardware hybrids is not intended to be a limitation.
An example of an exemplary market segment is provided by the small enterprises"" need for servers. Typically, a small business cannot afford to hire system analysts or incur the costs charged for over the telephone or online trouble shooting or backup servers. On the other hand, the server requirements for a small enterprise are quite modest as a rule, being limited to serving a small number of machines and in executing routine file sharing and printer sharing services combined with limited Internet access. At the same time, it is not desirable for a small enterprise manager to opt for a system configuration that results in dependence on a single manufacturer.
Pricing a general purpose operating system, then, requires inclusion of costs for testing and supporting a complex system, although many of the features may have little utility for a small enterprise. Furthermore, the presence of additional features inevitably compromises the product since the desired reliability, by a small enterprise, of continuous operation measured in years is difficult to provide in complex operating systems with a multitude of functions. These considerations apply equally well to other software products.
Thus, there is a need to supply such market segments with software products that are priced to reflect their actual cost of support and development while allowing the consumer extensive choices. In case of software it is often possible to manufacture products that more than meet the needs of a market segment by including extensive functionality. Furthermore, to the extent there are cost savings possible due to the nature of the market segment definition, competition requires that such savings be passed on to the consumer. Such savings can be realized by better testing regimes that reflect the actual likelihood of failure for the particular product, including product configurations with limited functionality.
Existing reporting tools for reporting the results of system performance studies, however, do not satisfactorily meet these testing needs. In addition, the almost complete product is tested via one or more xcex2-releases. Such testing relies on experienced software users putting the product through its paces and reporting back results to the manufacturer. Consumer feedback is yet another source of data for improving the product in subsequent releases or piecemeal fixes. However, the latter is an expensive strategy, both for the manufacturer and the consumer, and it can earn the manufacturer the wrath of irate consumers. Thus, a better method and system are needed to test the reliability of software and hybrid software/hardware products that are designed to interoperate with third party products. Furthermore, it is desirable to accurately estimate the long term operation of the product to enable better pricing and marketing decisions.
In view of the foregoing, the present invention provides a uniform, easily extensible, reliability reporting framework that includes a plurality of reporting clients that concentrate on tracking and reporting reliability data. This framework provides testing that goes beyond traditional stress testing by estimating actual product reliability over a long period of time, and not necessarily just heavy use. Furthermore, measuring different kinds of failures expected over extensive periods of simulated operation allows more accurate pricing and estimated support costs for interoperating third party products and their compatibility.
The testing method and system includes operating a software product of interest in different conditions chosen to reflect real world conditions. The frequency of operation is proportionally increased to compress expected operations over a long period of time into a shorter and more reasonable testing period to reduce the cost of testing while providing data that is meaningful in estimating prolonged operation of the product. Preferably, an accelerated life test (xe2x80x9cALTxe2x80x9d) controller coordinates the testing.
Furthermore, the testing includes random scheduling of tasks and sleep periods to better sample the state space. This results in superior identification of possible failures, including catastrophic failures, compared to mere stress testing since it samples failure states that are not detected by traditional stress testing or periodic testing. Finally, the use of pseudo-random numbers allows for testing uncorrelated activities while allowing easy reconstruction of a failure for debugging and improving the product being tested.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments, which proceeds with reference to the accompanying figures.