Data and event management networks provide an approach to remotely monitor and maintain many remote systems gathered under an organization. These remote systems may generate multiple events (e.g., hardware and software management events, application metrics, etc.) These events may produce indicators known as telemetry data. For example, telemetry data may include indicators that CPU utilization has dropped below a threshold, a file system is full, or a component has failed. This telemetry data may be utilized by the data and event management network to act on the remote systems to address the events. For instance, typical responses may include shutting down a remote system, triggering log file archiving, or scheduling a card replacement request, to name a few examples.
The transmission and management of system telemetry data for an organization's multitude of customer systems is a difficult problem. Current approaches to this problem utilize a traditional centralized event management architecture. This centralized event management architecture may have many global data centers that funnel their data through an event consolidation center to be analyzed and managed at various management consoles.
In such a system with a centralized event management structure, all systems events must be funneled through a central location. This central location acts as a bottleneck to the system events. The central location is also a single point-of-failure. Furthermore, replicating this central location can be prohibitively expensive. A problem with this traditional approach is that it cannot scale to handle millions of systems due to these central bottlenecks, single points-of-failure, and expense in replicating.
Furthermore, today's network environments are increasing in complexity at a geometric rate. The traditional centralized management structure tends to grow at a linear rate. This leads to a problem that eventually a critical point will be reached where the complexity of the managed network environment exceeds its manageability. As a result, under a centralized approach, increased complexity leads to increased system management costs. Furthermore, human intervention is much more expensive than automated responses. Human and system management resources cannot grow geometrically and, in many cases, the human resource allocation is shrinking. This results in a manageability gap.
A system that achieves global system event monitoring and management in a way that scales to handle a very large number of systems in a cost-effective manner and that is highly resilient to changes and failures in the global environment would be beneficial.