A distributed platform provides access to content and services from different points-of-presence (PoPs) dispersed across the Internet. The same set of content and services can typically be accessed from any of the distributed platform PoPs. The replicated access optimizes delivery by increasing total capacity for the content and services while also localizing delivery of the content and services to requestors that are geographically proximate to a respective PoP.
Access to the PoPs is controlled by the traffic management functions of the distributed platform. The traffic management functions are typically implemented in part with Domain Name System (DNS) servers. One or more DNS servers resolve client queries directed to content and services served from the distributed platform with addressing directed to different PoPs that are selected according to a configured traffic management scheme. The distributed platform can be a content delivery network (CDN), a cloud service provider, or any operator of a distributed set of servers, wherein the distributed set of servers collectively function to serve content or services over a digital network in response to different client requests for those content or services.
Maintaining a steady state across the distributed platform PoPs is a significant challenge. Expected and unexpected demand spikes regularly occur across the different PoPs and disrupt the steady state. The demand spikes can create fluctuations in the load, cache footprint, latency, and other performance or resource related metrics that each PoP experiences at any given time. Expected demand spikes can be due to the time-of-day at which requestors access content from the different PoPs as well as new content becoming accessible. Unexpected demand spikes can be due to unforeseen viral content, unforeseen events, attacks, network failures, or server failures in the PoPs.
Static traffic management policies of the distributed platform in combination with these demand spikes cause the steady state deviations across the distributed platform PoPs. For instance, Anycast or some form of requestor Internet Protocol (IP) mapping at the Domain Name System (DNS) level determines how client issued requests are distributed across the PoPs. The primary objective of these static traffic management policies is to optimize content delivery performance by routing requestors to the PoP that is geographically proximate to them. This however does not account for demand spikes that occur in the different regions served by the different PoPs, whereby the demand spikes can lead to an unbalancing of the steady state in two or more of the distributed platform PoPs.
Accordingly, there is a need for the distributed platform to implement dynamic traffic management policies. Dynamic traffic management would allow the distributed platform to respond to expected and unexpected demand spikes and restore the PoPs to a steady state by shifting traffic in between the PoPs as dictated by current traffic patterns. The dynamic traffic shifting can be used to rebalance loads between different PoPs, rebalance the cache footprint or cache utilization across the PoPs, improve latency, and rebalance other measureable metrics relating to the distributed platform's ability to serve content and services.
Failure to do so could result in degraded performance at one or more PoPs. Degraded PoP performance can manifest as slower response times (e.g., greater latency), dropped requests, and even resource failures. In the worst case, static traffic management policies combined with excess demand could render a PoP unresponsive or inaccessible.
Nevertheless, dynamic traffic management policies can produce unexpected ramifications across the distributed platform. In some cases, a DNS reconfiguration to restore the steady state at one PoP can create a chain reaction in which the steady state at another PoP becomes unbalanced as a result of the DNS reconfiguration, thereby requiring additional reconfigurations or traffic shifts. More generally, it is difficult to gauge the actual load, cache, latency, or other performance or resource impact any single traffic management change has locally on a PoP and globally across all distributed platform PoPs.
Humans cannot account for all the variables affected by a traffic management change. Humans are also slow to react to a sudden spike and may act hastily creating the undesired chain reaction across the distributed platform. Accordingly, there is further a need to automate any such traffic management function and remove the human element.