In the existing era, it has become common for information and software applications to be stored and run as services on cloud platforms. Communication service is one example in which this is becoming common, examples of communication services may include instant messaging, presence, collaborative applications, voice over IP (VoIP), and other types of unified communication applications. As a result of the growing reliance on cloud computing, it is a requirement for these services to maintain a high availability to customers. Satisfying this requirement is a complex problem due to the distributed nature of cloud-based applications and the non-trivial inter-dependencies of the service's components on each other.
The cloud computing architectures that are used to provide cloud services should therefore be resilient, i.e., provide reliable failover systems to ensure that users are not affected by outages that affect servers hosting applications or information for users. Further, said cloud computing architectures must be able to handle failure on a number of levels. For example, if a single server hosting IM or conference services fails, the architecture should be able to provide a failover for the failed server. As another example, if an entire data center with a large number of servers hosting different services fails, the architecture should also be able to provide adequate failover for the entire data center.
Further, some of the existing approaches for providing resiliency to the distributed system architecture include creating manual fault models using circuit breaker and fallback techniques. However, this technique has several drawbacks. Creating accurate fault models takes time and becomes prohibitively expensive if the functionality, architecture, and/or dependencies change often. Further, many factors affect functioning of a complex, distributed system, therefore manually created fault models are likely to miss many combinations of such factors. Human error and a lack of knowledge of all the dependencies for each component is likely to result in critical failures having high customer impact from being included in the fault models.
In light of the above drawbacks, there is a need for a system and a method which effectively detects and handles failure of one or more services hosted by one or more servers in a distributed computer architecture. There is a need for a system and a method which uses a proactive approach for monitoring any abnormality in the behavior of dependency services and automatically provide resiliency services, taking into consideration statistics of dependency services. Further, there is a need for a system and a method which can be easily integrated with any of the existing services. Yet further, there is a need for a system and a method which can be easily deployed and maintained without incurring high expenses.