Telecommunication service providers typically measure equipment High Availability (HA) as a percentage of time per year that equipment provides full services. When calculating system downtime, service providers include hardware outages, software upgrades, software failures, etc. Typical requested equipment requirements to equipment vendors are: 99.999% (“5”-nines availability), which translates into about 0.001% system downtime per year (˜5.25 min per year) and 99.9999% (“6”-nines availability), which translates into about 0.0001% system downtime per year (˜31 sec per year). Typically for highly sensitive applications 1+1 redundancy (1 redundant (standby) equipment piece (device) for each active equipment piece (device)) is implemented in an attempt to protect the service provider from both hardware and software failures. To allow for cost savings, N+1 redundancy schemes are often also used (1 redundant (standby) for each N active). The standby equipment replicates the corresponding active equipment.
Real time embedded system software is organized as multiple Cooperating Application Processes (CAPs), each handling one of a number of functional components, such as: 1) Networking protocols, including, e.g., mobile IP (MIP), Layer 2 bridging (spanning tree protocol (STP), generic attribute registration protocol (GARP), GARP virtual LAN (VLAN) registration protocol (GVRP)), routing/multi-protocol label switching (MPLS), call processing, and mobility management, etc.; 2) Hardware forwarding plane management (e.g., interfaces, link state, switch fabric, flow setup, etc.); and 3) operations, administration, and maintenance (OA&M), e.g., configuration and fault/error management, etc. To provide end-to-end services, a network provider has to configure multiple network nodes. Each of these nodes is an embedded system and has embedded application software implemented as CAPs.
FIG. 1A illustrates a portion of a known 1+1 redundancy network in which data is routed through various nodes A, B, C, and D, where each node includes various combinations of different CAPs. As shown, B provides 1+1 redundancy for A and D provides 1+1 redundancy for C. At any given time, either A or B is active, but not both. At any given time either C or D is active, but not both.
FIG. 1B illustrates a portion of a known N+1 redundancy network in which data is routed through various nodes A, B, C, and D, where each node includes various combinations of different CAPs. As shown, D provides N+1 redundancy for A, B and C. If A, B or C goes down, traffic with go through D.
Dynamic object state information (e.g. calls, flows, interfaces, VLANs, routes, tunnels, mobility bindings, etc.), which is maintained by a software application, is distributed across multiple CAPs and across control and data planes. Each CAP manages and owns a subset of state information pertaining to the software application. The logistics of functional separation is typically dictated by product and software specific considerations. Data synchronization across CAPs is achieved via product-specific forms of Inter-Process Communication (IPC).
Software support is critical for achieving HA in embedded systems. Hardware redundancy without software support may lead to equipment “Cold Start” on failure during which services may be interrupted and all the service related dynamic persistent state data (e.g., related to active calls, routes, registrations, etc.) may be lost. The amount of time to restore service may include, a system reboot with saved configuration, re-establishment of neighbor relationships with network peers, re-establishment of active services, etc. Depending upon the amount of configuration needed, restoration often takes many minutes to completely restore services based on “Cold Start”. Various system availability models demonstrate that a system can never achieve more than 4-nines HA (99.99% availability) when using a “Cold Start”.
Software requirements for “6”-nines HA generally include sub 50 msec system downtime on CAP restart, software application warm start, controlled equipment failover from Active to Standby nodes and not more than 3-5 sec system downtime on software upgrades and uncontrolled equipment failover. The sub 50 msec requirements are often achieved via separation of the control and data planes. For example, the data plane would continue to forward traffic to support active services while the control plane would restart and synchronize the various applications.