A network control system uses a cluster of network controllers to implement logical networks onto a physical network. One of the challenges of large networks (including datacenters and enterprise networks) is maintaining and recomputing a consistent network state in the face of various failures in the network. In some network control systems, changes are sent between different network controllers in the network control system. As changes are made in the system, a network controller may receive conflicting inputs from multiple controllers. For example, when slices (e.g., logical or physical network entities) are moved from one controller to another, there may be a period of time during which there are two controllers sending changes for the same slice. When one controller lags behind another, both controllers think they are responsible for the same slice. This may result in inconsistent state being applied at different controllers in the system.
In some network control systems, the network controller cluster computes all pending state changes in an arbitrary order such that related changes may not be processed together, resulting in inconsistent state for a period of time. For example, if the cluster is in the middle of computing a large amount of work (e.g., during slice rebalancing), and a logical network configuration change arrives that requires the replacement of a single flow in the dataplane, the cluster might delete that flow right away, and create the replacement flow much later, after the rebalancing work completes. The dataplane connectivity for that one flow would be down for the entire time while the flow is missing from the dataplane (possibly tens of minutes).
As another example, when a dataplane is already wired and working correctly, and the cluster restores from a snapshot, the cluster computes all the network state in an arbitrary order. If the cluster output tables were allowed to apply those changes to the dataplane as they are computed, the dataplane would suffer downtime during the entire computation time because the state is incomplete until the computation finishes. This does not happen in practice because the external output tables treat a snapshot restore as a special case and do not send changes to the dataplane while the cluster is working. However, it is undesirable to handle special cases like this.
In addition, in some network control systems, state is deleted inconsistently, resulting in inconsistent state. For example, when a controller sees state that the controller has not computed the need for, the controller will treat that state as garbage and delete that data lazily (but only when the cluster is idle). Treating network state as garbage and deleting it lazily can prolong dataplane incorrectness. For example, if the physical forwarding elements (PFEs) have a flow that is directing packets incorrectly and the controller did not compute the need for that flow (e.g., when the flow was not deleted or manually added), then the controller will treat that flow as garbage and not delete it for a certain period of time (e.g., at least 60 seconds). The garbage collection lag can be even longer while the cluster performs state computations. The network controllers delay garbage collection while processing the network state because the output is likely to be inconsistent until the processing is completed. The network controllers can be working for long periods of time before reaching a consistent state, prolonging the garbage collection time lag.
As another example, if the controller has computed the need for state in the past, but now decides to explicitly delete that state, the controller will delete that state from the forwarding elements immediately. In some network control systems, this distinction between garbage collection and explicit deletion is not applied consistently in the runtime, and leads to complexity and undesirable behavior. For example, when a publisher disconnects from a subscriber, the subscriber cleans up the subscription data received from the publisher after a brief time delay. The controller treats the cleaned up subscription data as explicit deletions and immediately deletes the state from the input tables even though the removal of the subscription data was not the result of a configuration change to explicitly delete the subscription data. Such deletions cause dataplane downtime whenever a subscriber loses a publisher for longer than a preset time delay. For example, when a backup controller is promoted to master before finishing the computation of the standby network state, the receiving controllers of the promoted backup controller may delete the received state from the previous master controller before the promoted standby controller can resume publishing new state.