The present invention relates generally to clustering operational parameter values, and more particularly to enabling problem resolution in a micro-service architecture used in a cloud computing system.
Today's computing landscape, in large enterprises or as part of the operation of a service provider, comprises a plurality of large networked computing resources with a large plurality of physical and virtual machines, software defined infrastructure components (e.g., software-defined storage, software defined networking) serving as a basis for a large number of interconnected micro-services. Such landscapes are difficult to maintain, especially because these days many micro-services provide functionality conforming to strict service level agreements (SLA).
Computing landscapes are very often implemented as cloud computing environments in which application functionality is decomposed into a set of collaborating micro-services, each of which may be scaled, upgraded, and managed independently by different developers. Since multiple networked micro-services work in cooperation, e.g., called each other's services, to generate a response to a user's request, guaranteeing an end-to-end view of the application execution becomes quite difficult. In particular, in case a problem situation, i.e., an error, a performance problem, or the like, arises. Isolating root causes for individual performance degradation or malfunctions in a production environment becomes a real challenge.
One of the key performance indicators for the operation staff of such cloud computing data centers running a large plurality of end-user applications is often the time to resolve an issue or incident. A highly sophisticated problem resolution approach requires more than a near real-time monitoring of elements of the cloud computing environment, like physical servers, virtual machines, storage systems, networking, and routing components, etc. Typical enterprise applications are more and more composed of hundreds of instances of heterogeneous micro-services. With developers constantly improving or adding new features to those micro-services and deploying them directly as production instances (under a DevOps approach), performance regressions are no longer a rarity. As the development of the applications and micro-services grows and diversifies over time, multiple versions of an application workflow and respective micro-services begin to coexist. Efficiently managing application performance in such polymorphic environments is pertinent for maintaining the end-user experience while interacting with the applications.
The expectation of end-users increases constantly in terms of problem resolution times, so that they can efficiently support their personal and enterprise goals. It is no longer sufficient to log performance parameters and potentially display them on displays of the system management console. In fact, operating and problem resolution staff has the same growing expectations in terms of computerized support for resolving occurring problem situations in the cloud computing environments. They expect more than nameless lists of operational parameters but direct insight and guidance how to repair broken systems.