1. Technical Field
This disclosure generally relates to computer systems, and more specifically relates to managing a cloud computing environment using streaming state data.
2. Background Art
The combination of hardware and software on a particular computer system defines a computing environment. Different hardware platforms and different operating systems thus provide different computing environments. In recent years, engineers have recognized that it is possible to provide different computing environments on the same physical computer system by logically partitioning the computer system resources to different computing environments known as virtual machines. The System X computer system developed by IBM is an example of a computer system that supports logical partitioning into multiple virtual machines. If multiple virtual machines on a System X computer system are desired, partition manager code (referred to as a “hypervisor” in IBM terminology) is installed that allows defining different virtual machines on the same platform. Once the partition manager is installed, virtual machines may be created that define different computing environments. The partition manager manages the logical partitions to assure that they can share needed resources in the computer system while maintaining the separate computing environments defined by the virtual machines.
Virtual machines are used extensively in computing solutions that are cloud-based. As the demands for cloud solutions increase, open source software for building clouds, such as OpenStack, have become a building block for creating a reliable and flexible cloud platform.
As cloud environments continue to grow in scale, management of the cloud environment becomes more complex and problematic. When a problem occurs in one virtual machine (VM) or in one localized section of the cloud environment, the task of determining the cause of the problem can be complex and labor-intensive. For example if there are 70 compute nodes in a cloud environment and the system encounters an issue, and if the root cause is not trivial, the administrator may face the task of having to examine possibly all 70 VMs to collect diagnostic data. In most cases the actual root cause may have occurred in some point in time prior to observable problems being detected. In many cases a root cause is in actuality a combination of two or more factors which static logging methods may not have the ability to correlate and as such administrators would need to manually mine this information. Aggressive static logging also has negative performance impacts and storage impacts if the system is trying to log everything and save that quantity of information to disk.
In global cloud environments where administrators may reside in different locations, there may be a knowledge gap of the current state of the cloud landscape that administrators in one time zone are aware of but may not have been apparent or known by administrators in another time zone. Things as simple as an administrator in the United States wanting to shutdown a system based in China but is not sure what IP addresses are in use or who are active users, which is information that would be known by the China-based administrators. There is currently no solution for allowing an administrator to determine current state of the cloud without taking many manual steps to interrogate the state of many individual VMs and host computer systems.