The invention relates generally to monitoring complex systems in which some subsystem performances are either hidden from external monitoring or the transaction flows across these subsystems are unknown. More specifically, the invention relates to systems and methods that monitor and estimate complex system load and determine transaction capacities.
A complex system is defined as a large number of integrated subsystems or applications that are coupled together and are used to solve complex business problems such as provisioning of services. One system architecture approach taken to investigate the flow of the functions performed by a complex system is to use performance data of the functions collected by a Workflow Manager (WFM), develop a response time bottleneck model, and apply the data to estimate the model parameters to make predictions about current and future capacity and response times for the functions. Once the problem functions have been identified, a root cause analysis may be performed on the underlying subsystems.
For complex systems having many supporting subsystems and applications, it is common to monitor hardware health such as CPU consumption, memory usage, and other parameters, of all the supporting subsystems. However, the WFM response times and subsystem software bottlenecks are not typically monitored.
Identifying poor subsystem performance using response times may be complicated by focusing on running the complex system efficiently. While running hardware components such as CPUs at high utilization may reduce the number of hardware components required, to end users, running efficiently is not the same as running acceptably. The response times' performances may be greater than stated targets. It is possible to have high efficiency but low acceptability. Aside from end users bearing the indirect costs of poor performance, the direct costs are increased staff.
There is a need to balance the tradeoff between additional systems costs and staffing costs, and the necessity to monitor response times for acceptability with respect to load. Both experience and analysis indicate that for complex functions traversing many subsystems, an average response time curve may be thought of as being flat or constant until reaching a specific load at which point the response times sharply increase. For these types of cases, there may not be an early warning such as a slow degradation in the response times to indicate a need to monitor and model response times.
What is desired are systems and methods that analyze complex systems and identify potential problems before they manifest into actual problems.