The World Wide Web has expanded to make various services available to the consumer as online web application. A multi-tiered web application is comprises of several internal or external services working together to provide a business solution. These services are distributed over several machines or nodes, creating an n-tiered, clustered on-demand business application. The performance of a business application is determined by the execution time of a business transaction; a business transaction is an operation that completes a business task for end users of the application. A business transaction in an n-tiered web application may start at one service and complete in another service involving several different server machines or nodes. For Example, reserving a flight ticket involves a typical business transaction “checkout” which involves shopping-cart management, calling invoicing and billing system etc., involving several services hosted by the application on multiple server machines or nodes. It is essential to monitor and measure a business application to provide insight regarding bottlenecks in communication, communication failures and other information regarding performance of the services that provide the business application.
A business application is monitored by collecting several metrics from each server machine or node in the system. The collected metrics are aggregated by service or tier level and then again aggregated by the entire application level. The metric processing involves aggregation of hierarchical metrics by several levels for an n-tier business application. In a large business application environment hundreds and thousands of server machines or nodes create multiple services or tiers, each of these nodes generate millions of metrics per minute.
If there is a failure in the metric processing system, for example a downed aggregator, a significant of data could be lost if the repair isn't implemented quickly. Loss of data, both in actually collected data as well as dropped scheduling of tasks to process data, may significantly impact the perceived health of a system and the ability to determine how a system is performing. What is needed is an improved method detecting and responding to aggregator failures that minimizes data loss and task tracking.