In cloud computing environments, applications are often configured to run on a cluster of virtual machines (“VMs”) that may run on one or more physical computers or nodes, such that each member of the cluster processes a part of the input to the cluster. This allows the applications to withstand greater loads that, without the cluster, may overwhelm the applications. Fault tolerance is an important aspect of a scalable application design in a cluster of VMs. Failure of one application instance or its physical host disrupts the network traffic flowing through it. This disruption may manifest itself as a connection loss between a client (for example, a browser application) and a server application (for example, a middlebox application) running on the cluster. Fault tolerance designs aim to allow applications to recover from failure without impacting the connectivity between the server and the client.
However, application level fault tolerance solutions increase design complexity, are specific to a particular application (and therefore not readily usable with other applications), and cannot completely mask failures (such as loss of client connectivity). While certain classes of large scale applications have built-in support for fault tolerance, commodity applications often resort to system level solutions to preserve application state upon failure. However, these solutions are often heavyweight and require a great amount of resources to backup a cluster of virtual machines. In many existing solutions, these problems could lead to load imbalance upon failure of one or more VMs in heavy load scenarios.
Referring now to FIGS. 1-2A, a method 100 (shown in FIG. 1) according to the prior art may be configured for execution by a processor on a computer system to perform load balancing of stateful applications running on a cluster of virtual machines (VM) using a split/merge paradigm. The VM cluster may physically reside on one or more interconnected computer systems, which may be nodes in a cloud computing environment. FIG. 2A depicts one such cluster 200 having one VM 212 (designated as VM 1) hosting a set of client sessions 230 {A, B, C, D} via a network controller 204 and an orchestrator 208. Each client session 230 has a corresponding client state 224 in one or more VMs 212. The client state 224 for a given client session 230 does not contain the corresponding application state, operating system state, or other states that are not unique to that client session 230. Rather, the client state 224 contains the corresponding client state: a subset of the data that the corresponding client session 230 requires to run one or more stateful applications in the VM's 212 application layer 216 (the client state may include, for example, time/session state for a client session, NAT configurations for a particular flow, etc.).
Referring now to FIGS. 1-2A, the client sessions 230 connect to the cluster 200 in step 104 of the method 100 by communicating with the network controller 204. The network controller 204 is responsible, in part, for directing network traffic flow (including, for example, by inspecting packet headers) of the client sessions 230 from their respective clients to VM1 (as well as to and between other VMs 212 that may be in the cluster 200). The network controller 204 communicates with the orchestrator 208 to determine which VM 212 holds or should service the client session 230. The orchestrator tracks the load on each VM 212 in the cluster 200, the location of each client state 224, as well as all other necessary network information (such as operating system, application information, etc.). In the depicted example, the network controller determines that each of the {A, B, C D} client sessions 230 should have their corresponding client states 224 present and processed on VM1 (in addition to other information associated with the client session 230 which may be necessary for servicing the corresponding client's use of the applications on the VM 212). The network controller 204 communicates this choice to VM1 . The network controller 204 directs the network traffic flow for the client sessions {A, B, C, D} to VM1, after consulting with the orchestrator 208. VM1 , and the VMs 212 on the network, generally, each have a system library 220 that provides the API necessary to generate a client state 224 for each client session 230 that they service. The API may be provided at the hypervisor level accessible to applications on a given VM 212, and allows the applications to create, store and retrieve per-client states (for example, client session states) and global states in the applications. The API may include, for example, the following:
 ID = create_state(size)state_object = get_state(ID)put_state(ID, state_object)gID = create_global(size)global_state_obj = get_global(gID)put_global(ID, global_state_obj)It will be apparent to one of ordinary skill in the art, based on the above table, how to implement an API to perform the recited functions of creating, storing, and retrieving per-client states. Applications running on the VMs 212 may use the above API to: get a client request; get the session ID based on the request; generate a state object by getting the relevant state; process the client request (including updating the state object and global states); store the updated state; and to reply to the client.
With continued reference to FIGS. 1-2A, FIG. 2A shows the status of the cluster 200 having one active VM 212 (designated as VM1) after four client sessions communicate with the cluster 200 through step 104 of the method 100. The active client sessions 230 are designated as {A, B, C, D}, and each client session 230 has a corresponding client state 224 on VM1.
Referring now to FIGS. 1 and 2B, two additional client sessions 230 designated as {E} and {F} are initiated in step 104 of the method 100. The client sessions {E, F} communicate with the network controller 204, which in turn communicates with the orchestrator 208, to select an available VM 212, i.e. VM1 for client sessions {E, F}. The method 100 may, through the orchestrator and the network controller, direct the network traffic flow of the client sessions {E, F} to VM1. The cluster 200 depicted in FIG. 2B services the newly initiated client sessions {E, F} in addition to client sessions {A, B, C, D} depicted in FIG. 2A, above.
Referring now to FIGS. 1 and 2C, before, during, or after performing step 104, the method 100 may evaluate the status of the cluster 200 in step 108 to determine whether the cluster 200 is load balanced. Having too many client sessions 200 serviced through too few VMs 212 is generally undesirable and may lead to a significant performance loss. Therefore, the method 100 may split the load of one or more VMs 212 in the cluster 200, and transfer some client states 212 to less burdened VMs 212 in step 112. By way of example, the method 100 may perform step 108 after the client sessions {A, B, C, D, E, F} are serviced via VM1. By analyzing the load of VM1 in step 108, the method 100 may determine that VM1 is overloaded and requires rebalancing.
With continued reference to FIGS. 1 and 2C, the method 100 may make additional VMs 212 available on the cluster 200, i.e. VM2 and VM3 . The method 100 may, through the network controller 204 and the orchestrator 208, select an appropriate VM 212 for each client session 230 to be moved. Since each client session's 230 load on a particular VM 212 is unique only at the granularity of its client state 224, all that the method 100 needs to move to a new VM 212 is that client state 224. Other information and states, such as operating system states and other application states already exist on other VMs 212 in the cluster 200 and need not be copied. Therefore, the method 100 may move the client states 224 for the selected client sessions 230 to the newly selected VM 212. In the example depicted in FIG. 2C, the orchestrator designates VM2 as a suitable VM 212 to service client sessions {C, D}, and VM3 as suitable for client sessions {E, F}. The network controller 204 moves the client state 224 associated with each of these client sessions 230 to the appropriate VM 212 in step 112, and directs the network traffic flow for each of the moved client sessions 230 to the appropriate VM 212 in step 116. During the time that the network controller 204 is moving a particular client state 224 to a different VM 212 in step 112, the network traffic flow of the corresponding client session 230 may be buffered and subsequently redirected to the new VM 212 in step 116.
Referring now to FIGS. 1 and 2C-D, one or more of the client sessions 230 depicted in FIG. 2C may terminate. For example, as depicted in FIG. 2D, client sessions {A, B} are no longer active. In step 108, the method 100 performs a load balancing check and may determine, based on a determination by the orchestrator 208, that the load balance of the cluster 200 is spread out too thinly. For example, the cost of operating an additional VM 212 may outweigh the efficiencies from having the four remaining client sessions 230 serviced by two different VMs 212. The method 100 may determine, then, that one or more of the client states 224 on one or more VMs 212 should be merged into a smaller number of VMs 212. In the example depicted in FIG. 2C, the method 100 merges the client states 224 of the active client sessions {C, D, E, F} into VM 2.
The network controller 204 and the orchestrator 208 may each be implemented as a program, hardware component, or a combination thereof. Each of them may, without limitation, be integrated into a single computer program running on one or more of the systems or nodes in the cluster 200. The orchestrator 208 may split or merge the contents of the VMs 212 on the cluster 200 at particular thresholds. These thresholds may be made configurable by a user, such as a network administrator, or may be configured to change according to predefined conditions.
Referring now to FIG. 1, steps of the method 100 may be formed in any order in sequence, or simultaneously. They may further be performed periodically. Additionally, steps of the method 100 may be configured to trigger the performance of its other steps. For example, while the method 100 may periodically perform load balancing checks in step 108, it may additionally perform this step immediately upon receiving a new client connection and before directing its associated network traffic flow to a particular VM.
Referring now generally to FIGS. 1-2D, the method 100 as described above facilitates a split/merge mechanism to load balancing of a cluster of VMs running applications that service client sessions 230. However, the method 100 does not provide fault tolerance. Failure of one or more VMs 212 in the absence of the disclosed invention's fault tolerance functionality may result in a loss of the client states 224 running on the failed VM 212.
It is therefore desirable to provide an elastic and lightweight fault tolerance solution for stateful applications operating in a cluster, having a transparent and load balanced recovery mechanism.