Content delivery networks (CDNs) and other distributed platforms are tasked with delivering the content of various content providers in an accelerated manner. To do so, these platforms deploy points-of-presence (PoPs) at various geographic locations. Each PoP includes one or more servers that cache and distribute the content provider content in an optimized manner. The optimized distribution stems in large part from the geographic proximity of the PoP to content consumers. Requests from the content consumers are routed to the PoP that is closest to them in order for the content consumers to receive the requested content over the fewest number of hops. The optimized distribution provided by the distributed platforms can also involve modifying the content itself by compressing, reducing resolution, changing bit rates, etc. so as to reduce the size of the content that needs to be delivered. In some cases, the optimized distribution stems from using higher performing resources of the distributed platform over resources of the content provider in order to deliver the content.
Distributed platforms continually update their resources and configurations to improve upon their content delivery performance. The resources include the deployed hardware such as the caching servers, routers, load balancers, etc. as well as any of the processors, memory, storage, and networking components of the hardware. Resources can be updated by swapping out lower performing hardware with higher performing hardware or adding to existing hardware. Resources can also be updated by scaling the deployed infrastructure. This involves increasing the content delivery footprint of the distributed platform by adding new hardware to supplement the hardware already deployed to the PoPs and by bringing new PoPs online to take over the content delivery duties from more remote or more overloaded PoPs. The configurations control the operation of the resources. Some such controls involve specifying how requests are processed and how requests are routed. The configurations also control how the caching server cache is managed by controlling what, how, and when content is cached and evicted from the cache.
The problem is how to test a resource or configuration prior to introducing the update into the distributed platform production environment, wherein the production environment includes the deployed resources and configurations that are actively responding to content requests and being used to distribute content provider content to different users. Testing any configuration or resource update in the distributed platform production environment is problematic because any error or incompatibility that is introduced by the update may not only degrade performance of the machine running the updated resource or configuration, but may have an undesired trickledown effect that degrades performance of other machines and other resources or configurations that interoperate with the updated configuration or resource. This performance degradation can affect customers that depend on the distributed platform for the distribution of their content and users that obtain the content from the distributed platform.
The alternative is to test the updated configuration or resource without real-time production traffic. However, production traffic is inclusive of most scenarios that the distributed platform is likely to encounter and not using production traffic in testing can lead to inaccurate and incomplete tests, whereby bugs remain hidden until the update is deployed to the production environment and run against the production traffic. One solution is to simply store a copy of the production traffic and replay it later in a test environment. This however introduces its own set of issues as the replayed traffic is no longer real-time and suffers from decay. Accordingly, time-to-live parameters, internal and external network conditions, authentication tokens, etc. may change from when the production traffic was originated to when it was replayed. These changes can again introduce inaccurate results, invalid test results, or, at the very least, performance results that cannot be compared with performance of the production environment.
Moreover, in some cases, the distributed platform cannot cold start a new configuration or resource in the production environment even when that resource is fully tested. In the context of a CDN, a new PoP that receives a cold start is likely to have performance issues. Specifically, the new PoP will not have any content cached when it begins operation. During operation, every request that initially comes in will likely result in a cache miss. The cache miss is an expensive operation that requires the new PoP resources to retrieve the requested content from an origin source, possibly cache the retrieved content locally, and then distribute the content to the requesting content consumer. Several such cache misses can overwhelm the new PoP resources and cause the new PoP to fail.
Accordingly, there is a need to test configurations and resources of a distributed platform with real-time production traffic in a manner that does not impact the performance of the distributed platform. There is also a need to avoid cold starting new configurations or resources in the distributed platform production environment.