Field of Invention
Embodiments of the present invention generally relate problem detection in a distributed services-oriented system, and more specifically to techniques for identifying which of a plurality of interconnected and interrelated services is the cause of a given problem(s).
Description of Related Art
Rather than relying on a single large software application to provide every facet of a modern software solution, many software solutions today are made up of a substantial number of different services that are designed to work together to provide the functionality of the software solution. For instance, rather than writing a single standalone application that provides an online content streaming service, such a service could be provided by tens or even hundreds of smaller software services, each designed to perform a specific set of tasks, and that work together to provide the content streaming service. Doing so has several pronounced advantages. For instance, it can be easier to compartmentalize the development of the software application, as each standalone service can be assigned to a small group of programmers for implementation. This helps to alleviate complicated merge operations and troubleshooting operations during the development process, as each standalone service can be compiled and tested individually. Additionally, doing so greatly improves the modularity of the software solution, allowing individual services to be easily removed and replaced with updated services that perform the same task. As yet another advantage, such a modularized design allows the software solution to be easily distributed and redistributed over multiple different compute nodes (either physical or virtual), based on how the different services are positioned and configured.
However, there are drawbacks to such a modularized design as well. For instance, it can potentially be difficult to pinpoint the root cause of a problem in a heavily distributed software solution. For example, consider a solution made up of several hundred interconnected services. In such an environment, a problem occurring in one of the services may adversely affect the performance of several other services, which in turn may adversely affect the performance of still other services. When this occurs, the developers and engineers may have difficulty pinpointing which of the many malfunctioning services originally caused the problem. As another example, when a particular service begins consuming a large amount of system resources, it may be difficult to determine whether an update to the particular service is causing the heavy resource usage, or whether an update to another one of the services is causing the heavy resource usage.