The present disclosure generally relates to managing system level activities in a multi-tenant cloud environment, and more specifically to techniques for protecting system level updates based on virtual machine (VM) priority in a multi-tenant cloud environment.
In many virtualized cloud environments, physical systems are generally consolidated and shared to provide virtual machines to multiple tenants (e.g., enterprises, such as an individual, organization, company, etc.). These cloud environments can be administered and/or managed by system management software (e.g., such as a hardware management console (HMC)), or other cloud manager software. In addition, multiple administrators may be responsible for maintaining such cloud environments. Having multiple administrators, however, can create complex challenges for managing such cloud environments. For example, with multiple administrators, it can be difficult to manage the cloud environments without impacting the virtual machines, while providing the expected level of service and performance required by the cloud tenants.
With the rapid growth in system technology, many servers today are equipped with multiple sockets (nodes) and non-uniform memory access (NUMA) type architectures. While this is advantageous for large scale clustered applications and cloud environments, such architectures can introduce performance challenges. For example, if VMs are not properly bonded with the right resource affinity, there can be impacts to the overall system performance. Administrators, therefore, typically perform periodic platform optimization across the system to ensure resource affinity and increased performance associated with NUMA architectures. In addition, administrators can also perform several other techniques to ensure that the proper level of service is provided to cloud tenants. Examples of such techniques include, but are not limited to, workload balancing, partition migration, partition hibernation, etc.
However, while these system level operations can, in certain situations, be used to balance and optimize resource use in the system and boost system performance, they can also expose risks to the systems in the cloud environment and to system users at the firmware level. For example, during these system level operations, the firmware can crash and/or experience other problems related to system operation. This can affect the entire system and, in some cases, cause the entire system to go down. As a result, the timing of such operations should be taken into consideration in order to minimize the impact to users. Accordingly, it may be desirable to provide techniques for controlling these system level operations, both at the system management level and tenant (or VM) level.