Today's enterprise networks have evolved into complex systems that consist of thousands and even tens of thousands of computing nodes, storage bricks, and multi-tier software components. As a result, how to effectively manage such large scale infrastructure has become an important problem. Data aggregation, the process of computing aggregate information from distributed data sources, is a fundamental building block underlying many applications for managing enterprise networks. Traditionally, data aggregation has been implemented by unscalable schemes such as centralized servers. As enterprise networks continue to grow in size, designing scalable, distributed data aggregation mechanisms has become an urgent task.
In centralized aggregation schemes, a central aggregation point is responsible for directly contacting each node and aggregating data from the node. As a result, it has fairly high reliability, which means a node will be missing from the aggregation only if the aggregation point cannot communicate with it. In distributed aggregation, both aggregation request and partial aggregation data are relayed by intermediate nodes. As a result, the data from some nodes may be missing due to intermediate node failures, even if the missing nodes are live themselves.
Achieving high reliability in distributed aggregation, i.e., aggregating data from all nodes that are currently live is highly desirable for many applications such as distributed performance/security problem diagnostics. For example, suppose a new software update has been applied to a large number of machines, which subsequently caused some subtle performance problem that only occurs infrequently on a small number nodes. In diagnosing such problems, it would be extremely useful to query if a particular error message has been logged on any machine. In such cases, missing just a few nodes might mean the problem goes undetected, which could be costly.
The reliability of distributed aggregation can be affected by two kinds of failures: forward path failures that prevent the aggregation request from reaching some nodes; and return path failures that prevent partial aggregation result from reaching the aggregation point. Forward path failures are caused by latent node failures, which refer to nodes that have failed, but have not been detected by their neighbors (those nodes that may potentially send an aggregation request to them). Since no failure detector can detect failures infinitely fast, latent failures are inherent in any distributed environment. To address such failures, forward path redundancy is used to propagate aggregation request along redundant paths. Return path failures are caused by node failures that happen during the aggregation. Such failures can be addressed by techniques such as child failure bypassing.
Using redundant request propagation to combat latent failures may increase system overhead. Thus, any reliable aggregation system should attempt to minimize the message overhead while improving the aggregation reliability. It is clear that the performance of redundant request propagation depends strongly on how a node manages its membership information (i.e., which other nodes are known to a given node). Several membership management schemes analyze how they support forward path redundancy.
Much previous work, see, for example, G. Khanna et al., “Application Performance Management in Virtualized Server Environments,” NOMS 2006, Vancouver, Canada, April 2006; X. Zhang et al., “DONet: A Data-Driven Overlay Network for Peer-to-Peer Live Media Streaming,” In IEEE INFOCOM '05, Miami, Fla., 2005, and M. Jelasity et al., “Gossip-Based Aggregation in Large Dynamic Networks,” ACM Transactions on Computer Systems 21, 3, August 2005, 219-252, has considered continuous aggregation where the aggregation functions are pre-installed, and the aggregate values are constantly updated.
Traditionally, aggregation is often associated with computing some aggregation functions such as COUNT, AVG and SUM. A more general view is provided in which aggregation means the process of executing an aggregation request on each node locally and returning the execution result in an aggregated form. The aggregation request can be any operation that is executed locally on each node (e.g., searching through a log file and returning a line that matches certain pattern). The aggregation is assumed to be reductive, i.e., the final aggregation result has a relatively small data size, which is independent of the system size. In the case of distributed log search, it is assumed that the aggregation request can specify aggregation functions such as “first K,” “random K,” etc., similar to the well known “top K” aggregation function for numeric values.
In a highly dynamic distributed environment, the reliability of distributed aggregation can be affected by two kinds of failures: forward path and return path failures.
Forward path failure means the aggregation request is sent to a dead node. Since this node cannot continue forwarding the request, it may cause some live nodes to not receive the request, and thus, be missing from the aggregation result. For example, suppose node D in FIG. 1 has failed but has not been detected by node A (i.e., it is a latent failure), when it receives an aggregation request, it will not forward it to other nodes. This may cause nodes C, F and G to be missing from the aggregation result, even if they are alive.
Latent failures are inherent in any distributed system. This is because no failure detector can detect failures infinitely fast in a distributed environment. One way to address latent failures might be to improve failure detection time. For example, to send heartbeat messages at a fast rate. However, this is problematic since it not only increases the system overhead, but also risks high false positive, i.e., declaring a node as failed when it is just a little slow. As a result, it is desirable to design a solution that can tolerate latent failures and achieve high reliability, rather than try to eliminate latent failures.
Return path failure refers to node failures that happen during the aggregation. This may prevent partial aggregation data from reaching the aggregation point. For example, suppose node D in FIG. 1 fails after it has propagated the aggregation request to C, F and G, but before it sends partial aggregation data back to A, the data from C, F and G might be lost, even if these nodes are live.
As alluded to in FIG. 1, a tree structure is well-suited for distributed data aggregation. Each interior node in the tree can compute the partial aggregation result for its subtree, and return the partial result to its parent. Therefore, the most straightforward way for distributed data aggregation is to maintain a tree structure. Whenever an aggregation request is issued, it is propagated down the tree, and the results are returned along the reverse tree edges.
Using a fixed aggregation tree, however, may suffer from the latent failure problem discussed earlier. To overcome latent failures, redundant request propagation may be used. This means the aggregation request is propagated along multiple redundant paths to each node. As a result, if some of the paths are broken due to latent failures, the node can still receive the aggregation request from other paths. If a node receives redundant requests from multiple senders, it chooses only one of them as a parent (e.g., the first one from whom the request is received) and sends its partial aggregation data to the parent. For other senders, it can send a “prune” message back so that the sender will not wait for data from this node.
When an aggregation request is propagated redundantly and each node chooses a parent only after an aggregation request is received, an aggregation tree that is dynamically discovered is actually used. Therefore, an aggregation system can be separated into two layers, the lower layer that manages membership information about peer nodes, and the upper layer that utilizes such information for request propagation and result aggregation. Such an architecture is shown in FIG. 2. It is clear how the upper layer can utilize forward path redundancy depends on how the lower layer manages membership information about other nodes in the system.
A special case for the membership management is just to maintain a tree structure. Each node knows about its parent and children. Such a membership layer does not provide any redundancy. The aggregation layer can only propagate the request along the tree edges. Thus, the aggregation tree is the same as the membership tree. When there are no failures in the system, such a membership scheme can achieve perfect reliability. However, when there are latent failures, the reliability will be affected.
Gossip protocols, (See, for example, A Demers et al., “Epidemic Algorithms for Replicated Database Maintenance,” ACM PODC, 1987) have recently been used for membership management in many systems (See, for example, A. Kermarrec et al., “Probabilistic Reliable Dissemination in Large-Scale Systems,” IEEE Transactions on Parallel and Distributed Systems 14, 3, March 2003; Y. Chu et al., “Early Experience with an Internet Broadcast System Based on Overlay Multicast,” Proceedings of USENIX Annual Technical Conference, Boston, Mass., June 2004; X. Zhang et al., “DONet: A Data-Driven Overlay Network for Peer-to-Peer Live Media Streaming,” In IEEE INFOCOM '05, Miami, Fla., 2005; and L. LIANG et al., “MON: On-demand Overlays for Distributed System Management,” WORLDS '05, 2005). The basic idea is that each node maintains information about a random subset of other nodes in the system (called a local view), and each node periodically exchanges such membership information with random members in order to keep the local view up-to-date.
Gossip based membership provides high degree of redundancy. The aggregation layer can propagate the request to any node in the local view. However, since the membership overlay formed by the local view is unstructured, high reliability can only be achieved if each node propagates the request to all (i.e., flooding) or a large number (See, for example, A. Kermarrec et al., “Probabilistic Reliable Dissemination in Large-Scale Systems,” IEEE Transaction on Parallel and Dist. Systems 14, 2, February 2003) of nodes in the local view, even when the system has no failures, which is undesirable. Another drawback of gossip based membership management is that each node only gossips with random members. Thus, it is difficult to tell if some members in the local view have failed.
Since a tree based membership lacks redundancy, and a gossip based membership lacks structure, a simple solution is to maintain both a tree and a random local view in the membership layer. When the aggregation request is received (at the aggregation layer), it is propagated along both the tree edges and some random edges. This may result in good reliability even with small degree of redundancy. Previous work (see, for example S. Banerjee et al. “Resilient Multicast Using Overlays,” Sigmetrics '03, June 2003) has explored similar idea for resilient multicast.
Our goal is to achieve high reliability for such on-demand aggregation in highly dynamic enterprise environment. High reliability means that the aggregate data should be computed from all live nodes currently in the system, despite the possible node failures that happen before or during the aggregation process.