In an Information Technology (IT) environment, resources can be monitored locally or remotely. Regardless which mechanism is used, customers expect the resource to be monitored at all times. Traditionally, multiple nodes (monitoring systems) would be designated to monitor the same resource, to assure fault tolerance in case one of the nodes that is monitoring the resource fails. While this introduces unnecessary load on the monitored resource, it does provide fault tolerance. At the same time this method introduces a side-effect of having duplicate data which means more data to process as the data is moved upstream. Instead of monitoring the resource redundantly, it would be possible to designate a central coordinator to tell which system should be effectively monitoring the resource. One of the benefits of having a central coordinator dispatch work is that load balancing is easily built into the solution. However, employing a central coordinator introduces at least two additional complexities: (1) The coordinator is now a single point of failure (which means it should be made fault tolerant); and (2) When the connection between the coordinator and the active collection system drops, recovery is limited to instructing another node to take over the workload.
Another aspect of fault tolerance is the prevention of data loss. The traditional way of solving data loss is by moving data to a central (fault tolerant) system as soon as possible. This means that all data has to be sent to a central server which introduces a significant overhead in the case where all data is not required on the central server.
There is a need, therefore, for a mechanism that minimizes duplication of data collection, and still maintains load balancing and fault tolerance.