A distributed data processing system typically includes a combination of hardware and software resources. The hardware resources may include a processor, a data storage unit, an input/output device, a network router, network link, etc. The software or ‘logical’ resources may include any computer program or program component, or a service provided by a hardware or software resource.
Monitoring of distributed systems is necessary for many purposes, including resource management, workload management (including load balancing and admission control), management of Quality of Service (QoS) and Service Level Agreements (SLAs), metering and accounting of system usage, fault detection and recovery and consistency management.
Monitoring of a distributed system typically comprises three steps: measurement of metrics and/or determination of the current state of a resource, collection of this data and reporting the collected data either as it is or in some processed manner to appropriate consumers. Based on measurement techniques, two different types of monitoring metrics can be differentiated: externally measurable metrics and internally measurable metrics. Certain types of parameters can be measured by measurement components external of the system, whereas resource-specific internal parameters can only be measured internally or in some cases also by the underlying computing layer such as an operating system.
Externally measurable parameters are generally used for determining the state of resources such as their availability, measuring performance such as throughput or response time, measuring usage of external resources such as network bandwidth, and for evaluation of QoS parameters. Internally measured parameters are used for determining resource utilization such as the number of threads used from a total number of available threads, identification of faults, and determination of resource usage at a given granularity level (per customer, request or process). The faults detected by internal measurement/monitoring may not be directly visible from the behaviour of the resource or system or from the values of external parameters. However, such faults may lead to reduced performance without a complete resource or system failure.
Factors such as granularity of measurement and the period between measurements are associated with each metric. The granularity of measurement may be per node, per container (containing one or more resource instances), per instance of the resource, per customer, or per request. The interval between periodic measurement of a parameter can be uniform along a time axis or non-uniform. The type of metric and factors such as granularity and period may determine where and how a metric should be measured—either by the resource internally, or by a separate computing layer or external measurement entity. Collection and reporting of monitoring data may be dependent on the granularity and period of measurement. Monitoring entities may process the collected data to generate monitoring data in the form required by the consumers.
There is a need for systems and methods that enable monitoring of both internal and externally measurable parameters. For example, there is a need for autonomic systems which can measure internal parameters for self-diagnosis and self-healing. In some cases self-healing or self-diagnosing may be impossible and so there is a need to support reporting of such parameters to external managers.
In some systems, internal parameters may be essential for metering and accounting of resource usage. Therefore, monitoring of such metrics is important for grid computing and autonomic computing, in addition to other computing paradigms that perform accounting functions based on resource-usage. Apart from metering and accounting, internal parameters are very useful in optimizing QoS objective functions, in resource management, in workload management, in studying system behaviours and correlating internal resource usage to the externally measurable parameter values. For example, in order to reduce the response time of a customer, the resource manager might have to increase the number of threads of the component. This is possible only if the manager knows about the internal load of the component in terms of thread usage.
Similarly, there is a need for measurement and reporting of internal parameters at the desired granularity level and desired period between measurements, in order to measure the resource usage of a component and to account for and bill the customer for the usage, to derive usage statistics, and to deliver such usage statistics to resource managers and SLA or QoS managers.
Many existing systems do not have sufficient flexibility to enable monitoring of service-dependent and internal metrics at granularities and periods according to the requirements of different consumers.