Fault detection deals with mechanisms that can detect both hard failures, such as link and node failures, and soft failures, such as software failures, memory corruption, mis-configuration, etc. Typically, a lightweight protocol is desirable to detect the fault and to verify the fault along the data path before taking steps to isolate the fault to a given node or link (e.g., diagnose the fault). Therefore, a fault isolation mechanism is also needed for fault management.
The problem of detecting hardware and software failures in a multipoint communications network, or in a distributed computing system, is very difficult to solve. By way of background, failure mechanisms for various network topologies and a proposed solution for a communications network are described in U.S. Pat. No. 6,732,189 entitled “Method and Apparatus for Fault Tolerant Tunneling of Multicast Datagrams”. U.S. Pat. No. 6,668,282 entitled “System and Method to Monitor and Determine if an Active IPSec Tunnel Has Become Disabled” teaches a technique for determining when communications through an Internet Protocol Security (IPSec) tunnel has failed, and steps for isolating the problem so it can be resolved.
Fault detection schemes for traditional wide area networks (WANs) such as Frame Relay (FR) and asynchronous transfer mode (ATM) networks are known in the prior art. For example, ATM networks commonly utilize a standard continuity check mechanism to detect hardware failures in the communications network with point-to-point connectivity. More difficult is the problem of resolving hardware and software failures in a multipoint communication network that allows each customer edge (CE) device or node to communicate directly and independently with all other CE devices in the same service instance via a single Attachment Circuit (AC) to the network. In a multipoint network, there are many paths that packet data units (PDUs) can travel.
Ethernet is a Media Access Control (MAC) layer network communications protocol specified by the Institute of Electrical and Electronics Engineers (IEEE) in IEEE specification 802.3 (the “802.3 specification”). Ethernet switched campus networks are an example of a multipoint service architecture. In the past Ethernet has been widely deployed in Local Area Networks (LANs). Today, Ethernet is migrating from LANs to metropolitan-area networks (MANs) and is becoming increasingly attractive to metro service providers (MSPs) because of its simplicity, flexibility, low cost, and quick time to service. From the standpoint of fault management, however, an Ethernet network poses an especially difficult problem because the MAC addresses that indicate the path that data packets travel gets “aged out” after a predetermined time interval (e.g., five minutes). In other words, the very information that is most useful for isolating faults in a multipoint network is transient by nature of the Ethernet protocol. Further complicating the problem is the fact that Ethernet services can be offered over a variety of transport mechanisms such as Ethernet PHY (802.3), SONET, ATM, FR, and multi-protocol label switching (MPLS)/Internet Protocol (IP)—e.g., an end-to-end Ethernet service for a customer can be offered over an Ethernet access network (an 802.1ad provider bridge network) on one side and a MPLS/IP access network on the other side.
Despite the problems inherent in providing a fault management mechanism (including fault detection) in carrier-class Ethernet services, MSPs still demand that Ethernet Virtual Connections (EVCs)—either point-to-point or multipoint—be protected by the same degree of fault management as existing ATM or FR virtual connections. Therefore, it is important to be able to detect and accurately isolate faults for any given Ethernet VC (or Service Instance) over any given transport type. Unfortunately, there are no existing solutions to the problem of fault management for Metro Ethernet (multi-point) services.