Contemporary high-speed routers are designed with multiple linecards that have separate control and data “planes.” A data plane includes, for example, devices on a linecard that actually move data traffic, while a control plane handles control traffic. The data plane and the control plane include physically separate devices that can, however, communicate with each other if need be.
According to conventional art, “keep-alive” packets are exchanged between routing devices (e.g., linecards mounted in routers or switches) in order to verify the integrity and availability of the network. If, for example, a routing device does not receive keep-alive packets from another routing device for a period of time (the “timeout” interval), the first routing device presumes that the other routing device is out-of-service. Consequently, the first routing device implements routing protocols that reconfigure its routing tables so that the out-of-service routing device is bypassed. During the time it takes to implement the routing protocols and populate its routing tables, the availability of the first routing device is negatively affected. Moreover, this effect is experienced by other routing devices that also need to reconfigure their routing tables. In essence, the network needs to reconverge, finding new paths that bypass the out-of-service routing device. Thus, the effect of the out-of-service routing device can propagate through the network, turning a local failure into a network-wide event.
To improve reliability, higher-end routers/switches are equipped with a redundant (or standby) linecard for each primary (or active) linecard. Should the control plane of the primary linecard fail, for example, then a switch can be made to the redundant linecard. Some routers/switches support “hot standby” operation, in which the routing tables of the redundant linecard are updated when the routing tables of the primary linecard are updated, so that the routing tables of the two linecards are identical. Without hot standby operation, time and data can be lost while the routing tables of the redundant linecard are populated.
Routers/switches may also be equipped with online diagnostic capabilities, allowing them to run diagnostic tests and monitor the “health” of their linecards during operation. If, for example, a malfunction of some sort is suspected on a linecard, the diagnostics for the devices on the linecard can be reviewed. If the diagnostics indicate that a device is not functioning properly, then an attempt can be made to reset that device.
Thus, higher-end routers/switches can respond in the following manner to a potential problem with a linecard. If equipped with online diagnostic capabilities, then it may be possible to identify a device on the linecard that is not functioning up to par and reset that device. If not so equipped, or if the reset attempt is not successful, then a switch to a redundant linecard can be made. Switching to another linecard is facilitated when hot standby is supported, as described above. Note that if hot standby is supported, it may be better to just switch linecards even if the router/switch is equipped with online diagnostic capability.
A shortcoming of the conventional art is that there is no mechanism available for identifying a defect or failure that is local to the data plane on a linecard. Currently, a defect or failure in the data plane remains unidentified until a downstream person or device recognizes that expected data is not being received. This approach is unsatisfactory because it fails to localize the failure; that is, the failure may have occurred in any one of the many upstream network devices. Also, by the time the problem is recognized and the cause of the problem then pinpointed, too much time has passed. Not only may data continue to be lost, but routing protocols may timeout, as described above.
Alternatively, as mentioned above, the absence of keep-alive packets can also be used to indicate a potential router problem. While this approach may be helpful in localizing the cause of a failure, it does not adequately address the time issue. That is, by the time the keep-alive packets are able to identify a malfunctioning linecard, data can be lost and routing protocols may timeout.
Accordingly, a method and/or system that can more quickly identify a problem in the data plane of a routing device, more quickly localize the source of the problem, and attempt to resolve it soon enough to avoid reconvergences, would be advantageous. The present invention is a novel solution that provides these advantages.