Computer systems are in an ever growing trend of becoming more powerful and intelligent. With the increase of Internet connectivity, software vendors today are aggregating computing resources to provide extremely powerful software services over the Internet—known as the “cloud-computing” model. The underlying software systems that power these internet services are distributed—they run on a large number of networked computer servers that communicate and coordinate. For example, it is reported that Google uses hundreds of thousands of networked machines to provide its internet services including search, Gmail, Google Doc, etc., and that Facebook also uses a similar number of machines to power its online social networking site.
These distributed software systems are extremely complex. For example, when a user accesses the internet service, a web server will first receive the request, and it may forward it to an application server which provides the actual service. The application server may further communicate with multiple storage servers on which the user data is located. Such setting can be commonly found in cloud vendors including Google, Facebook, etc., only that in practice there are many more types and quantities of servers (e.g., database servers, memory caches, etc.).
Because of the complexity, it is also extremely challenging to understand and analyze the behavior and performance of such systems. For example, if a user experiences slow responding time, finding the culprit in the hundreds of thousands of servers is like finding a needle in the haystack.
Problems in known systems include performance monitoring and trouble-shooting, failure recovery, and optimization.
Regarding performance monitoring and trouble-shooting, the performance of software services, e.g., user response time, has significant business impact. For example, Amazon.com has found that every 100 ms latency cost them 1% in sales, and Google has found an extra 0.5 seconds in search page generation time dropped traffic by 20%. Therefore it is important for software vendors to have tools to monitor performance, and analyze the root cause if performance is slow.
Regarding failure recovery, production software systems experience failures. For example, Google's Gmail experienced a 2-day outage in 2011, affecting hundreds of thousands of users, and Amazon's EC2 service had an outage for over 4 days in 2011. Once a failure occurs, it is important for a vendor to understand system behavior and to infer the root cause in order to recover from the failure.
Regarding optimization, software companies today spend billions of dollars on infrastructure. For example, Google spent 2.35 billion dollars on infrastructures in the first quarter of 2014 alone. Understanding the behaviors of these systems can reveal opportunities to optimize their resource usage, which can have a significant financial impact.