In a packet processing system comprised of a cluster of identical packet processing elements with 1-to-many (1:N) redundancy, the load must be distributed across the elements by an entity separate from the cluster of packet processing elements. Because this entity is conceptually located hierarchically “above” the cluster, the entity is often referred to as a load balancing tier or level. When one of the cluster elements fails, the load formerly handled by the failed elements must be redirected to one or more backup elements. This process or feature is herein referred to as “failover capability”. During the interval between the failure of one cluster element and the redirection of load to one or more backup elements, referred to herein as the “failover response period,” packets formerly processed by the failed cluster element may be lost or dropped. Thus, in order to reduce or eliminate lost or dropped packets, it is desirable to reduce or eliminate the failover response period.
One approach used by conventional implementations of element clusters is to rely on an external layer 2 switch to effect failover, by directing all traffic to one or the other of two separate chassis, i.e., the primary chassis and the secondary chassis, each of which may contain multiple processing cards operating as a cluster of elements. Under normal operation, the layer 2 switch directs all traffic to the primary chassis. Upon detection of a failure of the primary chassis, the layer 2 switch will then direct all traffic to the secondary chassis.
However, this conventional failover process is relatively time consuming. For example, the failure of the primary chassis must either be detected by the secondary or by another entity that must then notify the secondary of the primary's failure. In one conventional implementation, the primary and secondary chassis are organized in a high-availability configuration in which both the primary and secondary chassis share a common Internet protocol (IP) address, but where only one of the two chassis will respond to messages directed to that address. In this configuration, the layer 2 switch may broadcast regular address resolution protocol (ARP) messages that include the shared IP address, which are received by both the primary and the secondary chassis. In normal operation, only the primary chassis will respond to the ARP message, identifying itself to the layer 2 switch as the entity that responds to the shared IP address. Once the secondary chassis detects or is informed that the primary has failed, the secondary chassis then issues a gratuitous ARP message to the layer 2 switch, to identify itself as the entity now responding to the shared IP address. In response to receiving the gratuitous ARP from the secondary chassis, the layer 2 switch will update its routing table. The process of receiving the gratuitous ARP from the secondary and updating the layer 2 switch routing tables can take multiple seconds, during which time hundreds or thousands of packets may be lost. For example, a 100 Mbit/s Ethernet connection running at 50% utilization could transport approximately 4,200 1500-byte Ethernet packets per second; a 3 second failover response time could cause approximately 12,600 IP packets to be lost or dropped.
The steps performed by conventional systems and described above occur during the failover process. Thus, it can be said that these steps define or determine the failover response interval. The disadvantage of the conventional systems described above is that these steps are time consuming. The longer this process, the longer the failover response interval, and the more likely that packets will be lost or dropped.
Thus, in order to reduce or eliminate lost or dropped packets, it is desirable to reduce or eliminate the failover response period. Accordingly, there exists a need for instantaneous failover of packet processing elements in a network.