Groups of processors are sometimes organized in accordance with a topology such as a spoke-and-wheel topology, or a star or mesh topology, or as an M-by-N array. Groups of processors might be interconnected by a backplane or network of some sort such that any single processor node can communicate with at least some other processor node in the group. In some cases, groups of interconnected processors are logically and/or physically organized into a “cluster”, and the processors within the group of processors share a common storage facility.
For many reasons, the deployer of a cluster (e.g., a site manager, an IT manager, a CIO, a vendor, etc.) would want to assess the “health” of a cluster at any moment in time. Based on the health (or observed degradation thereto) of the cluster, remedial action might be taken by the deployer of a cluster. Cluster-wide health monitors often operate by installing an agent onto each processing node, collecting observations over time, retrieving (e.g., by a central node) the observations taken at the processor nodes, and assembling health reports (e.g., by a central node) based on the retrieved observations.
Unfortunately, legacy monitoring fails to make observations or otherwise take into account the health of a processor or group of processors within a group of processors that share a common storage facility. Situations such as one node blocking another node when both nodes access the same common storage facility go undetected and unreported.
Moreover, legacy techniques fail to achieve the necessary degree of resilience in the system so as to provide health reports in the face of faults or other events (e.g., interruption of service, or temporary or permanent node failure). For example, if the aforementioned central node goes down, then the entire facility to provide health reports also goes down.
The advent and rapid adoption of virtualization using virtual machines (VMs) and/or virtualizing executable containers (e.g., Docker containers) brings to the fore many new possibilities for a health monitoring system to advance to a much greater degree of observation and reporting. At the same time, the rapid adoption and uses of virtualization techniques brings to bear an explosion of observations. Strictly as an example, if a cluster has 1024 nodes, legacy techniques would collect observations at 1024 nodes. However in an environment where each processor hosts several, or dozens, or scores or more virtual machines, and in situations where there are inter-processor communications or effects that are being observed, the number of collected observations grows super-linearly. New highly resilient techniques are needed to deal with inter-processor communications and/or inter-VM communications.
A “virtual machine” or a “VM” refers to a specific software-based implementation of a machine in a virtualization environment in which the hardware resources of a real computer (e.g., CPU, memory, etc.) are virtualized or transformed into virtualized resources that support fully functional virtual machines that can be configured to run its own operating system. Using virtualized resources, an instance of a virtual machine can include all or some components of an operating system as well as any applications, browsers, plug-ins, etc., any of which can use the underlying physical resources just like a real computer.
Virtual machines work by inserting a thin layer of software directly onto the computer hardware or onto a host operating system. This layer of software contains a virtual machine monitor or “hypervisor” that allocates hardware resources dynamically and transparently. Multiple operating systems can run concurrently on a single physical computer and share hardware resources with each other. By encapsulating an entire machine, including CPU, memory, operating system, and network devices, a virtual machine is completely compatible with most standard operating systems, applications, and device drivers. Many modern implementations support concurrent running of several operating systems and/or containers and/or applications on a single computer, with each of the several operating systems and/or containers and/or applications having access to the resources it needs when it needs them.
Virtualization allows a deployer to run multiple virtualizing entities on a single physical machine, with each virtualizing entity sharing the resources of that one physical computer across multiple environments. Different virtualizing entities can run different operating systems and multiple applications on the same physical computer.
Such virtualization makes it easier to manage large set of processing nodes (e.g., arrays) that may be delivered with multiple processors on a board, or in a blade, or in a unit, or rack or chassis, etc. The health of such a set of processors as well as their respective peripherals has been attempted by using health-monitoring agents that take measurements and/or make observations at each node.
Unfortunately, rather than taking advantage of the flexibilities offered by virtualization, legacy techniques rely on centralized services provided at a central node. Such a central node can fail due to certain events (or be taken out of service due to certain events), which events can result in missed observations over the set of nodes. Missed observations in turn can precipitate a domino effect whereby early warnings and/or alerts might be missed, resulting in a lack of or late remediation, further resulting in degradation of performance or, in some cases, a complete loss of function of one or more nodes. Certain legacy deployments involving off-site centralized management facilities can fail due to failure of the off-site centralized management facility itself and/or failure in the communication fabric between the centralized management facility and nodes under management. Legacy deployments fail to provide any techniques for self-reconfiguration in the event of a failure of loss of utility of a computing component, whether they be hardware components or a software components. Moreover, legacy techniques fail to account for inter-node effects or other cluster-wide effects that emerge due to any of (1) aspects pertaining to inter-node I/O (input/output or IO), and (2) aspects pertaining to node-to-shared-storage I/O.
What is needed is a technique or techniques to improve over legacy and/or over other considered approaches. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.