The present invention relates in general to the field of computers and other data processing systems including hardware, software and processes. More specifically, the present invention relates to a method and system for performing load tests on data processing systems.
The use of computers and the networks that support them has grown substantially in recent years, creating the need for larger, more resilient hardware and software systems to accommodate increased numbers of users and volumes of information. One approach to handling increased user loads and processing volumes is to spread users across a system comprised of a number of subsystems. Achieving desired overall system response time, availability and reliability design goals requires testing these subsystems at the same load levels they would be subjected to in their operational environment. A common load testing approach is to create a realistically large number of virtual users whose behavior mimics that of human users. These virtual users then enact predetermined test cases and procedures that mirror the interaction their human counterparts will have with the system once it is placed in operation.
In general, load testing approaches include establishing a number of operational profiles to target a number of subsystems for ‘n’ number of virtual users and ‘p’ number of other parameters. Typically, these operational profiles are applied gradually and uniformly against the target subsystems until full load levels have been reached. If the system fails before full load levels are reached, corrections are made and the load test is run again, repeating the process until the system operates as desired. During the load test, properly functioning subsystems may absorb one or more degraded subsystems' share of the workload, masking their sub-optimal performance and unnecessarily extending the time it takes for the subsystem to eventually fail. When this happens, not only is time lost before the next test run can be made, but insufficient test data is produced, making it more difficult to determine and resolve the cause of the subsystem's failure.
The problem of identifying which subsystems are performing properly and which ones are not can be time consuming since the load test can continue for days before a degraded subsystem fails sufficiently to be identified as a problem. For example, a long-term reliability test may be scheduled to run under load for a predetermined time, e.g., seven days. During the test run, some or all of the virtual users implemented for the test run may terminate due to a sub-system's gradual failure which is masked because healthy subsystems were absorbing its respective share of the workload. Since the virtual users have terminated and their associated operational profiles and code paths are very long, the tester can only gain partial visibility into the cause of the subsystem failure. If, however, the failed subsystem had been able to continue without its share of the workload being offloaded, it would fail sooner and more relevant diagnostic information would be available for determining the cause of the failure.
In many performance testing procedures, testers rectify performance and reliability problems as they are identified and then re-execute the test run to expose the next problem. This incremental approach can be time consuming and expensive. If an ailing subsystem eventually fails in the 48th, 72nd or 96th hour of a long-term test run, the problem is exacerbated as significant time is added to each test run interval.