Cloud computing refers generally to an information technology paradigm that enables access to shared pools of configurable system resources and higher-level services that can be rapidly provisioned with little management effort, oftentimes using distributed computing systems and oftentimes over the Internet. Cloud computing generally relies on the sharing of resources to achieve coherence and the ability to scale, potentially dynamically, to address requests from requesting computing systems and the like.
Cloud computing has become more and more ubiquitous, especially as the microservice architectures have gained in popularity. Microservice architecture may be thought of as being a variant of the service-oriented architecture (SOA) architectural style, which structures an application as a collection of loosely-coupled services. In microservices architectures, services typically are fine-grained and the protocols typically are lightweight. Potential benefits of decomposing an application into different smaller services include the ability to improve modularity, and the ability to make applications easier to understand, develop, and test. Microservice-based architectures can enable continuous delivery and deployment so that applications can emerge over time, e.g., through independent and modular development.
Current failover systems are mostly implemented in hardware systems and/or third-party software. Although such systems and/or software approaches are of great benefit in a variety of scenarios, they generally do not work well with today's cloud computing infrastructures and microservice architectures, where failover mechanisms oftentimes need to be tailored to the demands of specific web applications, services, or the like. That is, current failover mechanisms generally have no knowledge of the data or use cases of the corresponding services. Yet module-based standalone services are much more applicable to modern usage in distributed computing environment. As a result, well-known failover mechanisms generally only look at the availability of specific services or application back-ends (e.g., databases, search engines, etc.) at a very abstract level.
One well-known type of failover mechanism involves a simple “aliveness” check, e.g., a ping, which determines whether a specific node of a cluster is reachable. A common strategy to provide for failover with this approach is the fail-fast strategy. This strategy directly triggers a failover mechanism as soon as a specific node is determined to be no longer reachable. In this approach, if there is downtime corresponding to even a few seconds or a few minutes, the overhead of switching to another service instance is accepted and incurred, which in some instances can even lead to longer downtimes or delays, e.g., where there are time-intensive startup procedures for complex services or replication instances, etc. Another common strategy is the on fail, try next available strategy. In contrast to the fail-fast strategy that simply triggers a failover as soon as a node is not reachable anymore, this strategy first tries the next best reachable node before triggering a failover. Yet another strategy is the on fail, try all strategy, which tries to reach all available nodes in a cluster. These latter two strategies, however, are not particularly well tailored to cloud computing environments and still depend on simple “aliveness” checks and thus operate at a very abstract level.
It will be appreciated that in providing a more sophisticated failover mechanism, it would be desirable to provide or collect appropriate information and send it to a service that is configured to process the provided information and take decisions based on patterns, machine-learning algorithms, hard-coded rules, and/or the like, that specify how to handle specific situations such as, for example, how to handle the network traffic associated with a specific web service or application in the event of a failure. The information could include, for example, parameters of the underlying infrastructure or cloud computing environment, status events from the involved services, failure detection within the infrastructure, errors arising within a running application, etc.
Additionally, it will be appreciated that it would be desirable to provide an enhanced solution that can decide on how and when to trigger certain failover mechanisms, e.g., based on the computational results described above. The technical solution could in some instances consider a second stage detection mechanism that makes it possible to verify whether the provided information is correct, if something else has led to the failover situation, etc. In this regard, a so-configured failover service therefore may include a component that checks on a lower level whether a failover is needed (e.g., whether connection to the underlying database really has been lost) and, if so, determines how failover can be achieved.
Switching to a potential backup service usually involves computational overhead and otherwise resource intensive tasks. It therefore would be desirable to configure a failover service with an intelligent component that can make decisions regarding whether and at what time a switch is needed, whether a simple wait and retry loop with a possible notification to the end users about the service unavailability could be an adequate solution, etc.
One aspect of certain example embodiments relates to addressing the above-described and/or other issues. For example, one aspect of certain example embodiments relates to failover systems suitable for use in cloud computing environments, e.g., where microservice, SOA, and/or other architectures are implemented.
Certain example embodiments provide a resilience multiplexer configured to collect information and act as a communication unit between all involved applications and services, including backend services of the environment to be supervised. The resilience multiplexer is certain example embodiments directly communicates with the failover service, which is responsible for initiation of failover procedures. In certain example embodiments, a component manages a wait and retry loop in case the decision was made to not trigger a failover, and keeps track and learns from the decisions made, e.g., for application to future events.
Certain example embodiments advantageously provide an approach that determines whether failovers really are needed as well as a generic failover approach that involves take a deeper look into the involved applications and services. Certain example embodiments thus provide technical improvements in terms of triggering fewer failovers, triggering failovers at more appropriate times, providing failovers specifically tailored for distributed computing environments including cloud-based applications and services, and provide for the ability to learn how to more gracefully handle issues that arise in the future.
In certain example embodiments, a distributed computing system (e.g., a cloud computing system) is provided. In the system, a plurality of computing nodes cooperate to execute a shared application and/or service accessible by client computing devices over a network. The system also includes a failover service. Processing resources, including at least one processor, are configured as a multiplexer for use with the distributed computing system. The multiplexer is configured to at least receive a signal indicative of a potential problem with a component of the distributed computing system and/or the network; responsive to receipt of the signal, identify, from a store including a plurality of rules, at least one rule to be executed to determine how to respond to the potential problem, the at least one rule being identified based on (a) attributes of the received signal including which component generated the signal and what information is included in and/or otherwise associated with it, and (b) other network-related data; execute the at least one identified rule to determine whether a failover is or might be needed; in response to a determination that a failover is needed, confirm that the failover is needed and, based on the confirming, selectively trigger the failover service to initiate a preprogrammed failover sequence; and in response to a determination that a failover might be needed, initiate operation of a resilience mode in which: information regarding the potential problem is communicated to one or more components of or connected to the distributed computing system, other than the component that generated the received signal, without immediately initiating a preprogrammed failover sequence; and the operation of the resilience mode is continued, without initiating a preprogrammed failover sequence, until one or more predefined conditions are met, at which point either (a) the operation of the resilience mode is cancelled and the failover service is triggered to initiate a preprogrammed failover sequence, or (b) the operation of the resilience mode is cancelled and the failover service is not triggered.
According to certain example embodiments, one or more machine-learned rules may be considered in determining whether a failover is or might be needed. Output from the one or more considered machine-learned rules may be weighted less than the at least one identified rule in determining whether a failover is or might be needed.
According to certain example embodiments, machine learned rules may be based on failover efficacy information, which may be provided over a feedback bus or the like by an authorized user and/or automatically and programmatically. The efficacy information may indicate whether a failover ultimately was needed in response to a given received signal, how one or more components performed during operation of the resilience mode, etc.
According to certain example embodiments, the information regarding the potential problem is communicated to the failover service and indicates that a failover might soon be needed, and/or the information regarding the potential problem is communicated to one or more client computing devices that is/are at least attempting to interact with the shared application and/or service.
According to certain example embodiments, the received signal may originate with a cloud controller provided to the distributed computing system that is configured to track distributed computing system service information, and the one or more predefined conditions may relate to whether the distributed computing system is operating in a maintenance mode and a retry counter, as determined in connection with the cloud controller; the received signal may originate with a registry service provided to the distributed computing system that is configured to track computing node status, and the one or more predefined conditions may relate to whether a computing node has been added, stopped, or removed, as determined in connection with the registry service; the received signal may originate with an error handler of the shared application and/or service, and the one or more predefined conditions may relate to whether there is a connection issue with respect to the shared application and/or service and/or a runtime error, as determined in connection with the error handler; and/or the received signal may originate with the failover service, and the one or more predefined conditions may relate to whether there is a connection issue, node failure, and/or node stoppage.
In addition to the features of the previous paragraphs, counterpart methods, non-transitory computer readable storage media tangibly storing instructions for performing such methods, executable computer programs, and the like, are contemplated herein, as well. Similarly, servers, client devices, and the like, usable in connection with the systems laid out in the previous paragraphs, also are contemplated herein.
These features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.