Computer systems and applications have become an increasingly complex web of interdependencies, as numerous complex systems are internetworked in support of modern commerce. A single account access request by a user may result in hundreds of calls to Application Programming Interfaces (APIs) and other services relied on by a front-end or other application. Applications may be structured to rely on multiple other dependencies to provide and modify data hosted and/or controlled by other systems. These dependencies may be used to provide information to hundreds of thousands of requests every hour, for example. Performance problems at these dependencies may result in the application entering an unhealthy state, for example leading to the application being unable to retrieve necessary data or timing out on user requests. In a high-traffic system, this quickly means that a large number of requests, transactions, and other actions are disrupted.
Service impacts due to performance problems in complex systems can be very damaging. The duration, number of impacted users, and extent of impact to service can increase the profile of a system outage or other unhealthy event. Some companies are subject to reporting regulations as service impacting events meet certain severity criteria. Thus, it can be critical to quickly diagnose and address a root cause of a performance problem in a complex application dependency setting.
Most troubleshooting processes rely on human administrators making use of multiple systems/monitors/alerts and then tracing break events back to a source, trying to find the first one thing that broke. The ability to correlate or delve into the dependencies that are affecting the system in question is a challenge in existing solutions. If any given set (one or many) of those dependencies fails, both that set and the actual system may suffer an error. System administrators work to differentiate the root cause, the first break, the impact chain of cascades, and how to address the break event. Users would need to use multiple different systems for each purpose and then aggregate issues and reports together mentally/visually using whatever tools they have.
System philosophies such as “you build it, you own it” may provide clear points of contact for troubleshooting. But owners responsible for a system that is healthy but for downstream dependencies would need to reach out to other administrators responsible for each dependency, triggering a cascade of calls each taking potentially tens of minutes. This requires slow and careful effort by the system administrators, which can mean that problems persist for extended times as root causes are determined and corrective action is taken.
Monitoring tools such as dashboards may allow system administrators to view various metrics related to system operation. Experienced sysadmins may be able to leverage these dashboards to identify likely causes of system instability. Monitoring interfaces, provided by monitoring interface applications such as Splunk and Elastic (ELK) may facilitate the creation of dashboards by creating interfaces to surface various operating statistics for an application and allow inspection of application attributes. However, these monitoring interfaces still require the experience of a system administrator to select the proper attributes to monitor, and to read the tea leaves during an unhealthy state to determine a root cause. And the monitoring interfaces associated with an application may only provide information about the state of the application itself, and not surface particular issues in application dependencies. In existing solutions, system administrators responsible for an application use monitoring and alerting tools configured to look at their own system. When an alert is detected, they investigate and take corrective steps. When a dependency impacts a system, the impacted system has to manually contact the dependency owning team and convince them of impact and need to take steps. This approach may fall short in complex systems or due to the nature of the impact. Furthermore, the impact is not always apparent as it can range from total system failure to nothing at all. This creates opportunity for human error and judgment errors that carryforward small mistakes that, over time, can also lead to system failure.
Aspects described herein may address these and other shortcomings in existing solutions. Novel aspects discussed herein may facilitate improved monitoring of system health based on application dependencies, and may allow for reduced incident recovery time and increased system resiliency. Quickly singling out a source of brokenness in a complex system can avoid otherwise lengthy troubleshooting involving many other systems. This may reduce wasted effort spent by system administrators hunting a root cause. And through use of monitoring interfaces, aspects herein may provide monitoring of dependencies without requiring additional connections and modifications to integrate monitoring services.