Network Function Virtualization (NFV) is widely being considered as one of the key enabling technologies for 5G networks. One of the main motivating factors behind NFV is to provide a technology that will enable the operators and service providers to provide and manage resources and services in an efficient and agile manner with reduced capital expenditures (CAPEX) and operating expenditures (OPEX), reduced new service roll-out time and increased return-on-investment (ROI).
An NFV system consists of Virtualized Network Functions (VNF) that are deployed on servers, which can be referred to as compute nodes, located inside a datacenter. A Cloud Management System (CMS) is an integral part of such a NFV Infrastructure (NFVI) and is responsible for the Management and Orchestration (MANO) of NFVI resources, such as compute nodes, CPUs, network resources, memory, storage, VNFs etc. For effective MANO decisions, a CMS relies on the presence of a reliable and robust monitoring system that monitors the utilization of the NFVI resources and VNF Key Performance Indicators (KPIs) and keeps the CMS updated by the regular provisioning of monitored data and KPIs, e.g., percentage-utilization of specific resource units and/or aggregate resource utilization values of all the VMs in a physical machine, load experienced by individual VMs, etc. The CMS will regularly analyze the monitored data and derive appropriate Lifecycle Management (LCM) decisions. A CMS may have an administrative domain that includes multiple physically separate datacenters. Under such circumstances, the CMS may manage and orchestrate services that span across multiple physically separate datacenters. In order to do so, the CMS relies on monitored data received from each of the multiple physically separate datacenters within the administrative domain.
As part of the MANO operations, a CMS will impart relevant LCM actions on the individual VNFs and/or their underlying resources in order to ensure its operational/functional integrity. LCM actions may include scaling in/out/up/down, migration/placement, update/upgrade, delete etc. of individual VNFs and/or their respective resources. For example, a VNF instance may be scaled whenever load on a VNF increases beyond a specific threshold or a VNF may be migrated to another Physical Machine (PM) (e.g., a server), whenever there are not enough resources available to satisfy the functional and/or operational scope of the VNF. Arriving at the correct LCM decisions is itself an incredible challenge owing to the variety of VNFs that need to be managed inside of an NFVI. The complexity of a VNF may also vary where more complex VNFs may embody a complete system, for example a Virtualized Evolved Packet Core (vEPC) system that is formed of multiple VNF Components (VNFC) interlinked over standard and proprietary Virtual Links (VLs). The example of such a complex VNF is illustrated in FIG. 1. In FIG. 1, a Virtual Mobility Management Entity (vMME) VNF and a Virtual Serving Gateway/Packet Data Network (PDN) gateway (vS/P-GW) VNF are composed of different VNFC interlinked over VLs. Each VNF is allocated a set of resources, and a CMS performs LCM. The LCM tasks include both Service Orchestration (SO) and Resource Orchestration (RO) of the VNFs and their respective underlying resources, both virtualized and physical.
The complexity of MANO operations performed by a CMS further increases as it also needs to manage not only VNFs but also Network Services (NS) that are formed by chaining relevant VNFs, e.g. firewalls, video optimizers, schedulers, virtualized EPCs, etc. If the LCM decisions on actions are not taken with care and deliberation, the LCM actions on one or more resources (e.g., infrastructure resources, VNFs etc.) may have an inadvertent adverse impact on other resource elements that may be relying on the services of the managed resource element and/or sharing the other resources. For example, a migration decision on a VNF belonging to a particular active NS may not only have an adverse impact on the overall Quality of Service (QoS) and/or Quality of Experience (QoE) of the NS itself, but it may also inadvertently impact the QoS and/or QoE of other VNF(s)/NS that may be sharing its resources with the migrated VNF due to resource contention. The QoS and/or QoE degradation of one NS may also impact on the QoS and/or QoE of other NS that were relying on the services offered by that particular NS. The CMS will thus perform a second iteration of LCM actions to rectify from this degraded service situation, and thus it is very much likely for the CMS to run multiple iterations before a stable and optimum situation is achieved. However, it is highly undesirable to run multiple iterations of LCM/orchestration decisions within a short span of time as it results in continuous service interruption thereby impacting the overall QoS/QoE. In other words, the CMS will have a poor Quality of Decision (QoD).
The QoD can be measured in terms of two mutually dependent criteria. First, the QoD can be measured in terms of how resource efficient the management action is. The resource efficiency can be measured in terms of whether both the long term and short term resource requirements of the managed VNF will be fulfilled in the selected compute node and how non-intrusive a management action has been for other VNFs that are already provisioned in the selected compute node. A management action is non-intrusive to the extent that the action does not affect the performance—in terms of resource availability—of other VNFs in the compute node(s) involved in the management action. Second, the QoD can be measured in terms of the number of times the management action has to be executed before the most-suitable compute node is determined to migrate/scale the managed VNF to.
The QoD of the CMS in turn depends on both the quality and quantity of the information that it receives from the monitoring system. The quality depends on the variety of KPIs that are reported to the CMS while the quantity depends on the frequency of updates of the KPIs that the CMS retrieves. Information provided by a monitoring system may include a variety of key-performance indicators (KPI), e.g., percentage-utilization of specific resource units and/or aggregate resource utilization values of all the VNFs in a physical machine, load experienced by individual VNFs, other QoS parameters etc. The CMS may then analyze the received data in order to find the state of NS and take appropriate LCM actions, for example, whenever it senses high load/utilization events. The volume of data that a CMS relies on to perform management and orchestration of services and resources can be massive. A datacenter may host hundreds of thousands of servers, and each server may host tens of hundreds of VMs and thus host and manage thousands of virtualized services. The data volume can further increase in the case of multiple datacenters. The quality of the LCM decisions taken up by the CMS is highly dependent on both the quality and quantity of the monitoring data provided by the monitoring system. The quality depends on the variety of KPIs that is reported to the CMS while the quantity depends on the frequency of updates of the KPIs that the CMS retrieves.
As a result of the sheer volume of data that the CMS must monitor in order to perform MANO operations, a very high load is placed on the network resources via which such data is delivered to the CMS. Furthermore, processing such data results in a very high processing load being placed on the CMS. The CMS can also incur processing delays that may cause a sluggish reaction by the CMS to undesirable events, e.g. the resource requirements of a VNF exceeding the resources available to the VNF.
Prior art CMS have utilized three main modes for acquiring data. First, prior art CMS have utilized a periodic mode in which monitored data is delivered periodically and in which the period and type of data is specified. Second, prior art CMS have utilized a pull mode in which monitored data is provided only when it is solicited by the CMS. Third, prior art CMS have utilized a push mode in which monitored data is sent only when a specific event is triggered, for example, when a CPU load or a network load on a VM exceeds a specific threshold.
While those methods have been exhaustively explored in the literature, they present significant limitations. Periodic delivery of data is identified as the standard approach for monitoring resources statuses. However, in the case of very large datacenters it can considerably exacerbate the burden and complexity of the monitoring process. Conversely, utilizing a pull mode can solve the huge overhead issue but needs a proper design in order to provide the QoE/QoS guarantees. Last, push mode can be tuned so as to recover the system when it is close to alert-states but it may prevent an optimal allocation/distribution of VNFs/VMs within the available servers.