This invention relates to large-scale computing systems. Latency is one of the more important problems to address in large-scale, distributed computing systems. In general, latency can be defined as the period of time that a component in a computing system is waiting for another component. Latency can therefore be considered “wasted time,” and is consequently undesirable. On an individual computer, latency is typically low and of little concern to the computer user, but on a large-scale distributed computing system with millions of users, it is desirable to minimize latency so that each user experiences the system as being responsive to their requests. In order to minimize latency, it is important to identify when latency degradations occur and to find potential causes for the latency degradations so that relevant alerts can be sent to administrators and other parties that are interested in the health of the computing system, so that they can take the appropriate actions to address the latency degradations.
One example of a large-scale distributed computing system is a system that is used to host an electronic mail (email) system. For example, the hosted email application “Gmail” is available from Google Inc. of Mountain View, Calif., and is run on a large distributed system. The application has millions of users and serves a very large number of requests every day on many computers. Each of these requests has some associated variables, for example, the browser used by the user, the location of the user, the location of the Gmail cluster serving the user, the action the user is taking (for example, sending mail, reading mail, and so on), just to mention a few variables. Each of these variables has a set of possible values, which can affect the latency of the requests, either by itself or in interaction with other variables. As can be realized, there can be several million of possible variable value combinations that may explain any latency degradations in the computing system. Given this, it is prohibitively computationally expensive to enumerate all the possibilities.
Thus it would be desirable to have improved techniques for navigating a large search space to find potential latency degradations and the variables that are responsible for causing the latency degradations. At the same time, it is also desirable to use computing resources efficiently, and to generate a relatively small number of meaningful alerts that can be appropriately addressed.