This invention generally relates to systems of computer devices, and more specifically to adaptive monitoring of such systems.
Systems of computer devices are becoming increasingly heterogeneous and complex. At the same time, such systems are also becoming more and more service driven, from which users expect constant, reliable service availability.
For example, networks today may include a large variety of different access networks and core networks and may be required to offer many services simultaneously. In addition, these networks may need to exhibit a much more dynamic behavior than in the past in order to be able to adapt, substantially in real-time, to end user needs for best quality of experience (QoE) and operator needs for optimal resource management at reasonable operator expenditure (OPEX).
These factors make network management complicated and the requirements and the expectations that network operators are able to offer (user-centric, end-to-end, always-best connectivity) become high. Particularly, it requires network management systems that are complex, distributed and to a large extent adaptive to changes in the network. This, among other reasons, drives the development towards policy-based network management that is adapted to deploy expert knowledge in the network regarding services, interaction between services, user preferences and strategic views of business to allow the network to make decisions on how to manage these services in a dynamic, heterogeneous multi-service environment.
In any distributed self-managed network, for example driven by policies, the devices of the network exhibit individual behavior in order to fulfill a service and/or user requirements. This individual behavior will affect the network as a whole. Therefore it becomes crucial to be able to observe the behavior of the network for purposes such as forecasting and detection of undesired behavior and malfunctioning devices. In order to be able to monitor the behavior of the network, the management system needs to monitor events relevant to the network as well as the status of the network.
In order to be useful, the management system may infer both how and what the network is doing (events relevant to the network) and how this impacts the status of the network. Ideally, the management system may extrapolate what might happen in the network based on knowledge about what has happened in the network in the past. For this purpose, so called Key Performance Indicators (KPI) and Key Quality Indicators are used that describe how network operators evaluate the efficiency and effectiveness of their use of existing network resources. These indicators can be based on a single performance parameter such as the number of missed calls on a network device or in a network. The indicators can also be based on complex equations involving multiple network parameters.
Other types of systems of computers or computing devices, such as systems of hosts, networked devices, virtual machines, and other devices, may be monitored in order to achieve or fulfill service or user requirements, or to manage or improve the operation or efficiency of the system.
Monitoring large data-centers, for example, is critical for performance management and troubleshooting, and requires monitoring tens to hundreds of thousands of physical servers and network elements. With vitalization, when a physical machine can host one or more virtual machines which need to be individually monitored, the monitoring requirements further increase to monitoring millions of elements over time. The rate of monitoring (i.e., number of samples measured per unit time) is a critical factor in troubleshooting performance problems; however, there is a natural trade-off between the amount of monitoring and troubleshooting accuracy. The higher the rate of monitoring, the higher is the accuracy, but the amount of monitoring overhead is also higher. Thus, most monitoring systems seek to achieve a balance between monitoring accuracy and overhead.
In the case of data-center networks, troubleshooting performance problems is particularly hard. This is because the time-scale over which events happen can be on the order of milliseconds to a few seconds, while most monitoring systems measure performance at the average rate of a few minutes. Thus, many events may not get captured; for example, short bursts of data flows can happen between virtual machines (VMs) which can cause packet losses on the internal network due to network congestion or a short-lived spike in CPU utilization by a VM. Such performance problems directly affect the short-term application performance, but are hard to detect with coarse grained monitoring. This creates a need for fine grained monitoring of data-center elements, but the monitoring overhead can be prohibitively high.
Existing solutions range from lightweight techniques such as collecting packet counters at the interfaces using SNMP and flow-level counters using tools like NetFlow, to detailed application-level logs and fine-grain packet logs. While the effectiveness of the former technique depends on the time granularity of logging, the latter technique is expensive to run continuously. Adaptive monitoring techniques also exist that vary the monitoring rate over time; however, these are mainly adaptations from large scale wired networks, which are distributed techniques for adaptive monitoring of network elements using local information.