Third Generation Partnership Project (3GPP) is currently creating updated specifications for core network nodes, such as a Mobile Management Entity (MME), a Packet Data Network (PDN) Gateway (PGW), and a Serving Gateway (SGW). These nodes will eventually replace the existing Gateway General Packet Radio Service (GPRS) Support Nodes (GGSNs)/Serving GPRS Support Networks (SGSNs) used in earlier implementations of the 3GPP network for the Internet Protocol (IP) connections with mobile User Equipments (UEs).
When a UE in Long Term Evolution (LTE) powers up, an IP connectivity session (always on) is established with the core network. In system Architecture Evolution (SAE), this IP connectivity is currently known as a Packet Data Network (PDN) connection, an Evolved Packet System (EPS) session, or an Evolved Packet Core (EPC) session.
Although networks are progressing towards an “always on” IP connectivity with UEs, not every UE has always on IP connectivity. In contrast, in LTE networks, all UEs will have always on IP connectivity. Additionally, Circuit Switched (CS) services will not be provided on LTE accesses, except by simulation over IP. Thus, IP service is more critical and of higher volumes than in current implementations. When LTE is fully deployed, PDN connections in the network will increase dramatically, with each PDN connection being critical for providing an LTE mobile service.
Core network nodes are expected to have capacities of the order of a hundred thousand to a few million PDN connections. There are several implications to these vast numbers of PDN connections. First, when a single node fails or the IP network to/from that node fails, there are potentially hundreds of thousands of PDN connections impacted. Second, due to roaming, a PGW may have different PDN connections that are involved with dozens of other peer nodes in other operator networks at any time. In a large operator's network, any core network node may have PDN connections going through hundreds of other peer nodes, even in the same network. These types of issues exist in existing GGSN/SGSN networks, although of smaller scale and where IP services are of much less criticality. For example, when an SGSN fails in an operator's network, the failure can impact operation GGSN nodes and other nodes of all types for a substantial period of time.
To counteract this weakness, operators typically have IP network path redundancy to improve IP network resilience. Obviously, almost all vendors try to reduce the frequency of complete node failure by providing internal hardware redundancy within a node for the components most likely to fail (i.e. power supplies, external links etc). However, some vendor's component redundancy might only be for availability (i.e. new PDN connections) not retainability (i.e., keeping existing PDN connections). Even in fully redundant hardware networks with retainability, this does not protect against double hardware faults or software or operator configuration faults of some types (which are duplicated). If these teardowns are signaled individually, this represents a significant signaling rate required just to indicate the connections are to be torn down. Additionally, a reliable peer node is impacted by unreliable peers.
These issues exist to some degree in existing SGSN/GGSN networks. International Publication Number WO 2005/079100 discusses issues related to SGSN/GGSN resources that apply to PGW/SGW/MME resources, but now must cover a wider range of possible resources. However, there are also some other new issues that have surfaced. FIGS. 1A and 1B are signaling diagrams illustrating a power up (initial attach) procedure in LTE/SAE in an existing system. FIGS. 1 and 1B describe this attachment process as detailed in 3GPP TS 23.401. A UE 10 needs to register with the network to receive services that require registration. This registration is described as Network Attachment. The always-on IP connectivity for UE/users of the Evolved Packet System (EPS) is enabled by establishing a default EPS bearer during Network Attachment. The Physical Connection Circuitry (PCC) rules applied to the default EPS bearer may be predefined in a Packet Data Network Gateway (PDN GW) and activated in the attachment by the PDN GW 12 itself. The Attach procedure may trigger one or multiple Dedicated Bearer Establishment procedures to establish dedicated EPS bearer(s) for that UE. During the attach procedure, the UE may request an IP address allocation. During the Initial Attach procedure, the Mobile Equipment (ME) Identity is obtained from the UE. A Mobile Management Entity (MME) operator may check the ME Identity with an Equipment Identity Register (EIR) 14. At least in roaming situations, the MME should pass the ME Identity to a Home Subscriber Server (HSS), and, if a PDN GW outside of the Visited Public Land Mobile Network (VPLMN), should pass the ME Identity to the PDN GW. FIGS. 1A and 1B are signaling diagrams illustrating the message signaling between the UE 10, an eNode B 16, a new MME 18, an old MME/Serving GPRS Support Node (SGSN) 20, the EIR 14, a serving Gateway (GW) 22, the PDN GW 12, a Policy and Charging Rules Function (PCRF) 24, and a Home Subscriber Server (HSS) 26. Relevant to the present invention, the new MME 18 sends a Create Session Request message 50 to the serving GW 22. The serving GW 22 then sends the create default request message 52 to the PDN GW 12. The PCRF 24 then provides a PCRF interaction 54 to the PDN GW. The PDN GW sends a Create Session Response 56 to the serving GW. The serving GW then sends a Create Session Response 58 to the new MME 18.
FIG. 2 is a signaling diagram illustrating an attach procedure for LTE with all GPRS Tunneling Protocol (GTP). This diagram is simplified to core nodes creating the PDN connection. FIG. 2 assumes that GTPv2 (GPRS Tunneling Protocol version 2) is being used between the three depicted nodes of an MME-1 80, an SGW-1 82, and a PGW-1 84. The MME-1 sends a Create Session Request 90 to the SGW-1 82. The SGW-1 then sends a Create Session Request 92 to the PGW-1. The PGW-1 84, in turn, sends a Create Session Response 94 to the SGW-1. The SGW-1 sends a Create Session Response 96 to the MME-1.
Proxy Mobile IP (PMIP) may also be used between the PGW and the SGW. FIG. 3 is a signaling diagram illustrating an attach procedure for LTE with PMIP. The diagram is simplified to illustrate the core nodes creating the PDN connection. As depicted in FIG. 3, the nodes are an MME 100, an SGW 102, and a PGW 104. The MME sends a Create Session Request 110 to the SGW 102. The SGW then sends a Proxy Binding Update 112 to the PGW 104. The PGW responds by sending a Proxy Binding Accept 114 to the SGW. The SGW, in turn, sends a Create Session Response 116.
There are several important features related to FIGS. 2 and 3. Each stable PDN connection exists only on one PGW, one MME and one SGW at a time (during a handover there can be more than one MME or one SGW for a short period of time). The PDN connection on the PGW cannot be moved without tearing the existing PDN connection down. The MME has stored critical information not available elsewhere, such as the UE's current tracking area list. The MME is the only node that actually initiates a mobility procedure (i.e., moving a PDN connection to another SGW or MME).
The above features implies that if an MME fails, the PDN connection for the PDN connection on the other two peer nodes must be released since there is no chance for recovery. PGW failure produces the same type of situation. Thus, the PDN connections that need to be released in the other nodes must be signaled to the peer nodes or tied implicitly or explicitly to some other identifier.
It is well known in the industry that to deal with entire node failures, it is sufficient to have two basic functions. The first function is the local node stores the remote peer node identifications (IDs) involved when it creates its corresponding local internal resources and stores the peer nodes IDs with the internal resources or equivalent information, such a pointer to an interface/socket, etc. . . . . The second function is a echo/heartbeat message with restart counter between at least the node's nearest neighbors. Absence of a heartbeat message for a period of time indicates a neighbor node is assumed down or a communication fault to the neighbor node has occurred. Receiving a heartbeat with a restart counter that is higher than a previous restart counter also indicates the peer node restarted without detecting a missing heartbeat. When a nearest neighbor is detected as down/restarted, this information might be relayed down the line. Simultaneously, all internal resources associated with the neighbor are torn down.
It should be understood, that at least one bit of the restart counter of each node needs to be explicitly stored with the PDN connection data. This is needed to allow for a peer restart and new PDN connections to be setup while old PDN connections are being released simultaneously. Otherwise, all PDN connection releases must be performed before any new traffic is setup, which is usually undesirable since it extends the outage duration but is otherwise a valid implementation. The restart counter allows for the possibility that communication is lost for a period of time between nearest or non-nearest neighbors. For example, if an SGW fails, it is not known if a PGW restarts during the SGW failure. This is not always an unusual situation since a PGW and SGW function might be collocated.
Systems may reduce the payloads needed to transfer the information by including the information with existing messaging. FIG. 4 is a signaling diagram illustrating the basic information necessary to allow PDN connection when a node fails. The data indicated in each message is stored either explicitly or implicitly with the nodes PDN connection data record. In FIG. 4, three nodes are illustrated, an MME 120, an SGW 122 and a PGW 124. A Create Session Request 130 containing MMEid, MME_restart_count information is sent to the SGW. The SGW relays this information with SGWid, and SGW_restart_count information in the Create Session Request 132 to the PGW 124. The PGW responds by sending a Create Session Response 134 having PGWid and PGW_restart_count information to the SGW 122. The SGW then sends a Create Session Response 136 having the PGWid and PGW_restart_count information with SGWid and SGW_restart_count information to the MME. The MME should be aware of which PGW node and SGW node is being used since the MME selects the PGW and SGW in the attach procedure. Thus, external signaling is not required. The MME should be aware of the SGW restart count from the heartbeats. The SGW should know the MME's identity and restart count based on the GTPv2 interface. The PGW is likely to be unaware of the MME's identity since it is communicated indirectly (especially in the PMIP case). Thus, the MME identity must be communicated explicitly. Explicit communication of each node's ID is generally preferred since some nodes are multi-homed.
FIG. 5 is a signaling diagram illustrating the minimal data needed in the messages. The actual data that needs to be logically stored and correlated internally in each node is the same as explained in FIG. 4. However, the message sizes are reduced. The basic heartbeat with node identity and restart count in the bearer setup is sufficient to deal with an entire node becoming unavailable.
A system and method are needed to allow a clean recovery for handling of faults corresponding to an entire node failure. If a major component of a node fails and most connections are still valid, it is not desirable to shut the full node down. This would result in tens or hundreds of thousands of PDN connections that would to be cleared. This places high loads on the nodes that have not failed simply to indicate a component failed.
There have been several proposals in 3GPP systems to deal with this issue. In some cases, an additional identifier is added. This identifier varies significantly dependent on the particular model of internal resources in each node. To fully understand this problem, hypothetical but realistic vendors' implementations shall be discussed below.
In a first vendor design, a PGW node is designed with many boards. Each board has one unique IP address. All boards serve any Access Point Name (APN). Load distribution for new connections depends on having a record for each board. This design is such that when a PDN connection is created, both the control plane and user plane function are on the same board. The board does not replicate PDN connection data between boards. Thus, if the board fails, the PDN connections on the board fail. Such a vendor might advocate that a system be developed based primarily on heartbeats. Specifically, if the heartbeat fails to a PGW control plane address, then the SGW should clean up all PDN connections for that PGW control plane IP address. The SGW then sends a “PGW partial failure message” to the MME, which cleans up PDN connections for that PGW control plane IP address. This gives a complete solution for cleanup of a PGW board failure for the first vendor design.
In a second vendor design, a PGW node is designed with a pair of control plane boards in an active warm standby mode. The control board serves any APN. PDN connection state data between the two control plane boards is not replicated. When a control plane board fails, all PDN connections are lost. There are also many user plane boards. Each user plane board has one unique IP address. The control plane board picks one user plane board to use for the PDN connection. The user plane boards also do not replicate. If that user plane board fails, the PDN connection has to be torn down and rebuilt. In this design, it would be advantageous to utilize heartbeats for both the user plane and the control plane. This would place a higher workload on the SGW and the MME. Thus, the resources in those nodes have to track both addresses.
In a third vendor design, a PGW node is designed with several control plane boards, each with a different IP address. The design uses device processor boards and user plane boards. The user plane boards only control specific device processor boards. If a user plane board fails, the device processor board must also be shut down. When a control plane board fails it brings down the PDN connections. In this design, as in the second vendor design, it would be advantageous if both user plane IP addresses and control plane IP addresses be tracked.
In a fourth vendor design, a PGW node is designed with a pair of control plane boards, each with a different IP address. Each control plane board is used in a fifty/fifty load sharing mode. The two boards serve any APN. PDN connection state data between the two control plane boards is constantly replicated. When a control plane board fails, both IP addresses are serviced by one board and no stable PDN connection data is lost. There are also many user plane boards. Each user plane board has one unique IP address. The control plane board picks one user plane board to use for the PDN connection. The user plane boards do not replicate. If that user plane board fails, the PDN connection has to be torn down and rebuilt. In this design, only a heartbeat based on the PGW user plane IP address is required.
In a fifth vendor design, full duplication of both control plane board and user plane data is implemented whereby a redistribution scheme is used to allow multiple failures if spread out in time. Only one external IP address to the core network side with multiple load shared interfaces is utilized. Thus, hardware faults in the node itself are not considered a problem. However, this vendor design concentrates on PGW for corporate APN and focuses on IP routing and outside support nodes (e.g., failure of a corporate server for a corporate APN). It is desirable in this design that an indication be provided for an APN that is down (e.g., the APN and the PGW IP address together as a key in a “partial failure message” to the SGW). In this design, the SGW clears all resources associated with the PGW IP address and APN combination. The SGW also indicates the same type of fault to the MME which clears the same data.
In a sixth vendor design, a similar type of high reliability as the fourth vendor design is used. But this vendor design focuses on internal software faults. This vendor design is not related to hardware failure.
Obviously, no single type of single identifier can meet the needs of the above vendor designs. Furthermore, even for one vendor design type, there may be design modifications over time. This would be an internal issue except the peer nodes are forced to try to implement a search for the particular identifier and are expected to store this information and be able to search for it. Supporting all the various vendor designs is not a reasonable approach and it does not even address all types of internal resources/components.
Currently, there is an identifier designated as a Forlopp identifier (ID) in a Switching System for Mobile and Fixed networks/Application Execution Environment (AXE) design. The AXE creates a separate ID called a “Forlopp ID” for tracking a related set of resources in a telecommunications call. This is also a trace ID. When a call is initiated, a new Forlopp ID is requested from the Axe Control System (APZ). The APZ loads this ID in a hardware register that is maintained across signals between blocks. All blocks that were visited by the Forlopp ID are also stored by a Forlopp manager in the APZ. If a fault occurs during call processing, a hardware register has the Forlopp ID and an interrupt is generated to trigger the Forlopp manager to start the error handling process. The previously stored information is used to generate a Forlopp error message to each block with the Forlopp ID. The block receives the error including the Forlopp ID and uses this in a small routine to clear any resources associated with the call with that Forlopp ID. This prevents memory leaks and hung resources while not disturbing other calls on the AXE. During execution, a process may be stopped and restarted as part of waiting for external signals.
To support this function, a Forlopp ID adapted software module stores the Forlopp ID with the call related data. When the call is restarted, the block restores the Forlopp ID before continuing execution. It would be advantageous to utilize this existing concept to solve the afore-mentioned problems. However, the Forlopp ID in the AXE is on a call ID (or command ID) basis identifying a single call. This existing system utilizes a straightforward one to one mapping. Although this could be implemented with PDN connections, it would not assist in reducing signaling at faults which is one of the key difference between AXE Forloop and the present invention.
Signaling reduction may be achieved with a trivial extension of AXE if only a single node was involved. The MME is the first step in the chain. The MME could generate a Forlopp ID as an integer or other ID. That Forlopp ID would be included in the Create Session Request and the receiving nodes could store that ID. The MME vendor would look at its hardware/software model and pick the Forlopp IDs so the MME's components that are likely to fail are in one to one correspondence to the Forlopp IDs generated. Here the Forloop ID is chosen by the MME. However, the existing proposals and mechanisms use IDs that are chosen to correspond to externally seen identifiers (e.g., IP addresses of an interface or APN).
The Forlopp ID is managed by a single centralized function in the AXE. This cannot be applied to solve the afore-mentioned problems due to scaling and latency issues. Additionally, there are still the problems associated with a multi-vendor environment. In addition, the AXE typically behaves, in most respects, as a single CPU processor and monitors itself. The AXE does not deal with lost signals and peers going down (only processes).