Distributed communications environments typically include a plurality of nodes connected via an interconnection network. In order to establish communication between the nodes, the network is explored and the various elements of the network are initialized.
In one example, a node of the network is selected to be responsible for conducting the exploration and initialization. That node, referred to herein as the explorer node, typically, attempts to establish communication with the network elements by sending, for instance, initialization or request-status packets. If there is no response from an element, then the explorer must decide whether the next course of action is to retry or to give up and consider that the element or the path to that element is defective.
Often, the exploration process is conducted on a live system (i.e., the system being re-initialized), which tends to complicate the process by introducing other variables, such as network congestion. Thus, the cause of the delayed or missing response is generally unknown. That is, it is not known whether the packet has been lost, misrouted or delayed by other network traffic; or whether there is a defect in the path between the explorer and the target element (i.e., the element to be initialized); or if the target element is itself defective.
Previously, a retry protocol has been used, which waits a predefined amount of time and then resends the packet a set number of times (each time waiting the predefined amount of time for a response) before giving up. This technique is, however, error prone for a number of reasons. First, if the network is simply congested, then the explorer will eventually get several responses which it must handle. These responses may not return until well after the explorer has moved on; at which point, the explorer must be able to distinguish these responses as duplicates of a prior exploration and discard them. Second, the fault may not lie in the target element, but instead, be somewhere between the explorer and the target element. This technique cannot distinguish between the two. Third, the target may have discarded the packet because it was busy, and the retry could also be discarded, if it arrives at the target immediately after the original.
Thus, a need still exists for an exploration capability which overcomes the deficiencies of the previous retry protocol. In particular, a need exists for a technique that accurately identifies one or more faulty network components that resulted in one or more failures being encountered during the exploration of a network.