1. Field of the Invention
The present invention generally relates to networks. More specifically, the present invention relates to systems and methods for monitoring the availability of a network and, in preferred embodiments, the monitoring of availability of one or more nodes in a multicast network environment.
2. Description of the Related Art
At a high level of abstraction, a network may be thought of as a series of communicatively coupled nodes, which exchange information with each other. Non-limiting examples of nodes include client computing devices and platforms (e.g., personal computers, cellular telephones, tablet computers, etc.), routers, switches, servers and the like. Some nodes may generate information, such as servers; others may consume this generated information, such as client computing platforms, while yet others simply relay the information from one node to another, such as routers and switches. As data is transmitted from a source node to a destination node, it may pass through a series of intermediate nodes along the way. Often there is a one-to-one correspondence between source and destination nodes, which is termed unicast routing. Alternatively, a single source node may transmit generated data to a plurality of destination nodes, which is termed multicast routing. IP multicast is an example of a multicast routing method that is commonly employed in networks.
Determining communication failures within unicast routing is relatively easy, as the source node has advance knowledge of exactly who it wishes to communicate with and a lack of acknowledgement from the destination node back to the source node is construed by the source node as a communications failure. In multicast routing, however, the situation may become more complex, as the source node may not have advance knowledge of all the destination nodes. Typically, multicast routing is setup in a network in such a manner that the source node needs only to transmit a single instance of the generated data to another node; the intermediate nodes in the network replicate this source data as many times as needed to finally reach the multiple destination nodes.
Using IP multicast by way of example, a multicast source node uses a group address as a destination address for the data it wishes to transmit to multiple destination nodes. The destination nodes, in turn, use this group address to inform the network that they wish to receive packets sent to that group; that is, the destination node joins the source node's group. As a result, the source node does not need, or even have, advance knowledge of the destination nodes. Rather, it is nodes that are close to the destination nodes that keep track of replicating and distributing the multicast data. Under this framework, then, the source node cannot tell if a destination node has not received the multicast data, which can lead to problems in the network.
All users of multicast routing for communications fan out, particularly in distributed multi-publisher and consumer systems, which have a critical need for visibility into the health of multicast flows between components of the network. When this communication breaks down the result is often a non-trivial degradation of application performance or functionality. By way of example, the following two scenarios illustrate potential issues with traditional multicast network environments.
The first example involves a “split-brain” scenario in hot-hot application instances, as may occur, for example, in the trading of securities in the context of Smart Order Routing (“SOR”). As illustrated in FIG. 1, a network 10 implements SOR using a common distributed application architecture, in which SOR applications 22, 24 run as instances on destination nodes 20. In this grid-like architecture, redundant SOR instances 22, 24 off of (i.e., within the domain of) a parent node 12 are run in a so-called “Hot/Hot” configuration. A first SOR instance 22 is active and handles the processing of trades. A second SOR instance 24 serves as a data recovery (DR) backup instance in the event of failure of the primary SOR instance 22.
Problems can arise when communication between some or all of the components becomes limited, leading to a so-called “split brain.” When a split brain condition arises, parts of the application on separate, non-communicating components may have conflicting internal views of the network and data. FIG. 1 illustrates an example in which SOR 2 and SOR 2-DR both work the same order because SOR 2-DR, due to a limited ability to communicate with the network 10, as indicated by the corresponding dashed arrow, wrongly determines that SOR 2 is not functioning and so switches from backup mode to primary mode. As a result, both SOR 2 and SOR 2-DR work the same order.
FIG. 2 illustrates a multicast network environment 30 that includes a basic ticker plant 32, 34, such as that offered under the tradename Wombat or alternatively a custom application, publishing to a number of consuming agents 40. In this hypothetical market data environment, the publishing responsibilities are segmented into two parts: a first source 32 for securities having symbols beginning A-K, and a second source 34 for securities beginning with symbols L-Z. Destination applications 40 App 3 and App 4 are not getting source data from first publisher 32 due to a malfunction of an intermediate node 36, indicated by the corresponding dashed arrow. This is typically very difficult to detect because market data is flowing and only a limited letter range is being adversely affected. Since the letter ranges and ‘splits’ are often rebalanced to manage market volumes, manually configured monitoring is unmanageable because of the frequency and scale of these changes.
The current solutions on the market for detecting the above issues have difficulty scaling without adding replication burdens to the network and incurring considerable costs. Typically, specialty network monitoring hardware, such as numerous Corvill or NetScout probes, for example, are required in addition to added replication on the network components themselves. It is therefore desirable to have methods and related systems that can monitor availability of one or more nodes in a multicast network environment in a robust manner without unduly burdening the network or introducing excessive costs.