1. Field of the Invention
Embodiments of the present invention generally relate to distributed computing systems and methods, and more specifically to evaluating the resiliency of a distributed computing service by inducing latency.
2. Description of Related Art
A broad variety of computing applications have been made available to users over computer networks. Frequently, a networked application may be provided using multiple interacting computing nodes within a distributed computer system. The networked application exists as one or more networked application components executing on one or more computing nodes. For example, a web site may be provided using a web server (running on one node within the distributed computing system) configured to receive requests from users for web pages. The requests can be passed to an application server (running on another node within the distributed computing system), which in turn processes the requests and generate responses passed back to the web server, and ultimately to the users.
Another example of a networked application includes a content distribution system used to provide access to media titles over a network. Typically, a content distribution system may include various servers such as access servers and content servers. Clients may connect to the servers using a content player, such as a gaming console, computing system, computing tablet, mobile telephone, or network-aware DVD player. The content server stores files (or “streams”) available for download from the content server to the content player. Each stream may provide a digital version of various forms of video or other content, such as a movie, a television program, a sporting event, user generated content, or a staged or live event captured by recorded video. Users access the service by connecting to a web server, where a list of content is available. Once a request for a particular title is received, the title may be streamed to the client system over a connection to an available content server.
In systems such as these, latency and errors may occur in various communication paths between a networked application component running on one server and a dependent networked application component running on another server. These latency or error conditions may result from a server or network device that is overburdened or has experienced a software or hardware failure. In some cases, the dependent networked application component may not be resilient to such latency or errors in the communication paths with the target application. As a result, the dependent networked application components may in turn introduce latency or errors in communication paths to other networked application components, potentially cascading latency, error conditions, or other problems in one or more application components throughout the distributed computer system.
Such cross-latencies and errors across multiple networked application components are difficult to test, in that latencies and errors within a complex distributed computer system are difficult to sufficiently model accurately. Network application components that may appear to be sufficiently resilient on a test system may nevertheless fail when deployed on the distributed computer system. As the foregoing illustrates, what is needed is a better way to test the resiliency of an application running on a distributed computer system.