Performance metrics, also known as Key Performance Indicators (KPIs), are used to measure the health of computer systems. These performance metrics can include measurements such as CPU usage, memory usage, free space or response time. The performance metrics may be collected from a large number of devices within a computer system and then used to measure the overall health of the system.
Due to the complexity of modern computer systems, it may be necessary to monitor large numbers of performance metrics, ranging from relatively high-level metrics, such as transaction response time, throughput and availability, to low-level metrics, such as amount of physical memory in use on each computer on a network, amount of available disk space, or the number of threads executing on each processor of each computer. Metrics relating to the operation of database systems and application servers, physical hardware network performance, etc., may all need to be monitored across networks that may include many computers (each executing numerous processes) so that problems can be detected (preferably before they arise).
It will therefore be appreciated that a computer system may have a seemingly endless range of performance metrics, and the performance metrics are not necessarily measured in a common scale or range.
Due to the complexity and potential number of performance metrics involved, it can be useful to only call attention to metrics that indicate there may abnormalities in system operation, so that an operator or the system does not become overwhelmed. Unfortunately, the sheer number of performance metrics, their complexity and their interrelations mean that existing analysis techniques are not adequate.
For example, an existing approach comprises a user manually specifying which performance metrics are related to a particular or importance performance metric. This can be time consuming, expensive and error prone. Furthermore, with system or network topology frequently changing, such information can quickly become obsolete.
It is also known to employ the Granger causality algorithm to detect if one performance metric has causal influence over another one, based on the values of the two performance metrics. This, however, may not capture all dependencies or relationships between performance metrics. By way of example, a first performance metric called System Health may be at a constant 100%, and a second performance metric called Network Traffic may fluctuate up and down. During a system anomaly, both of these values may become anomalous, but the Granger causality score would be low because most of the time, network traffic did not have any impact on system health.