1. Field of the Invention
The present invention relates to data communications, and, more particularly, to a method and apparatus for improved reliability in a network element.
2. Description of the Related Art
As more and more information is transferred over today's networks, businesses have come to rely heavily on their network infrastructure in providing their customers with timely service and information. Failures in such network infrastructures can be costly both in terms of lost revenue and idled employees. Thus, high reliability systems are becoming increasingly attractive to users of networking equipment.
For example, a common component in such networking environments is a switch, which directs network traffic between its ports. In an Ethernet network employing an OSI protocol stack, such a network element operates on Layer 2, the Data Link Layer. This layer handles transferring data packets onto and off of the physical layer, error detection and correction, and retransmission. Layer 2 is generally broken into two sub-layers: The LLC (Logical Link Control) on the upper half, which performs error checking, and the MAC (Medium Access Control) on the lower half, which handles transferring data packets onto and off of the physical layer. The ability to provide improved reliability in such a network element is important because such elements are quite common in networks. Providing the desired reliability, while not an insurmountable task, is made more difficult by the need to keep the cost of such network elements as low as possible. Thus, several approaches may be attempted.
For example, consider a switch in which a set of ports are connected to a switching matrix through a network processor or forwarding engine (FE). Assuming that the FE has an unacceptably-high failure rate (i.e., a low Mean Time Between Failures, or MTBF) and recovery time to meet the availability target for the switch. Several approaches can be used to address this problem.
The first such approach is to employ dual redundancy for all systems in the switch (port cards, route processor, FEs and so on). Thus, for each of the switch's ports, there is both a main FE and a backup FE. This provides a backup FE for each port, in case the port's main FE fails. Switchover to a backup FE is accomplished quickly, as the backup FE's configuration can be made to mirror that of the main FE. However, such an arrangement doubles the number of FEs in a given switch, doubling the cost of this solution when compared to a switch having a one-to-one ratio of FEs to ports. Thus, the high availability requirement nearly doubles the cost of the switch.
Another approach is to use a high-MTBF optical cross-connect (OXC) at the switch's input. Using this arrangement, in case of failure of an FE, the OXC redirects traffic from a port attached to the failed FE to another port (attached to an operational FE). This simplifies the internal structure and operation of the switch, offloading the transfer of data streams from failed FEs to operational FEs. Unfortunately, a high-MTBF OXC is required to provide the desired reliability, and, because such an OXC is expensive, unacceptably increases the cost of this approach.
A third approach is to employ dual redundant FEs. As in most switches, FEs are coupled to a switching matrix, to which the FEs couple data streams from their respective ports. However, in an arrangement such as this, two FEs (an FE pair) are coupled between one set of physical ports (referred to as a PPset) and one matrix port. Each FE in such a configuration is capable of servicing the FE pair's respective PPset. At any time only one FE need be active. In case of the failure of one FE, the other FE is called into service to process traffic. The two FEs are coupled through a small amount of electronics to the PPset and switch matrix. Thus, as with the first solution, the configuration of alternate FEs (which, in fact, could be either FE of an FE pair) is made to mirror that of the active FE, allowing fast switchover.
However, this solution entails some limitations. First, the electronics coupling the FEs to their respective ports and the switch matrix either is not redundant, or incurs the extra cost that attends such redundancy. Moreover, even if this circuitry is designed to have a high MTBF, increased cost is the likely result. Moreover, the additional FE for each PPset doubles the cost in comparison to having a single FE associated with each PPset. Thus, the costs for such a switch will also be significantly increased when compared to a non-redundant architecture.
Lastly, a switch can be configured with one or more backup FEs using a central mux/demux. This allows 1:N redundancy in the typical fashion, in which the port card connected to the failed forwarding engine is redirected directly to a backup forwarding engine. Such a central mux/demux is inserted between the switches ports and FEs. In this scenario, the FEs are connected to M switch matrix ports and to N PPsets through a N:M mux/demux, where M is greater than N and the number of backup FEs is M−N. At any time only N FEs need to be active to service all the PPsets in the system. Thus, when all FEs are in working order, there is a total of M−N backup FEs in the system. If any FE in the system fails, the system is reconfigured so that one of the backup FEs takes over for the failed FE by having the affected port card directly coupled to the backup FE. Such an approach is very efficient because the only switch over performed is coupling the affected port card(s) to the backup FEs.
Unfortunately, such an alternative also encounters some obstacles. First, the addition of the central mux/demux once again adds significantly to the cost of a switch designed in this manner. Moreover, the addition of the central mux/demux involves adding additional circuitry to the switch that may itself fail, thereby degrading reliability. If the central mux/demux is implemented using high MTBF circuitry (whether by redundancy or other method) to address this concern, the cost of such a switch will once again be increased.
As is apparent from the preceding discussion, while providing high availability is certainly possible, providing such reliability in a cost-effective manner is challenging. As with most engineering problems, a solution that is not commercially reasonable offers no real benefit to users (and so manufacturers). What is therefore needed is a way to provide for the reliable conveyance of data streams in an economically reasonable fashion.