1. Field of the Invention
The invention pertains generally to distributed computing systems. More specifically, the invention relates to monitoring client device request time and server servicing time in order to detect performance problems of a distributed computing system and automatically issue alerts.
2. Description of the Related Art
A typical distributed computing system includes client devices and servers coupled via a computer network. Clients make requests via the network, and servers process the requests and return results to the clients via the network.
One benefit of distributing a computing system using servers and clients is the convenience of being able to access data on one or more central servers from a client device physically located anywhere there is an available network connection. This convenience may further result in significant financial gains for a company taking advantage of a distributed computing system.
Distributed call centers are a good example of how distributed systems may be used to save company costs. By locating call centers in various locations throughout the world, a service company is able to take advantage of benefits such as different language abilities of local speakers, lower employment wages, and local time zones. Each call center may be responsible for handling support for a particular set of the incoming calls. For example, the various call centers could divide incoming calls according to geographic area, time periods of the day, or language requirements of the caller.
In another example, an airline company may reduce its office space overhead and employee turnover by having reservation agents operate out of their own homes. The personal residence of each reservation agent may be equipped with a client device such as a thin client computer terminal, an Internet connection, and a dedicated telephone line. The airline's reservation call system can then direct incoming passenger calls to reservation agents at their homes.
In an example unrelated to call centers, a franchisor may have many retail outlets spread over a large geographical area. To make sure sales are accurately reported for profit sharing purposes, each franchisee may be equipped with one or more point-of-sale (POS) terminals that automatically confirm purchases and track inventory in real-time with a central server. When supplies at a particular location begin to run low, a delivery truck can be automatically dispatched.
Distributed systems are so common that many people don't even realize they are using them. For example, each time a person withdraws money from an automated teller machine (ATM) or uses a credit card they are actually interacting with a distributed system client device, i.e., the ATM machine and the credit card swipe device. Inside the bank, tellers operate computer terminals that may simply be additional client devices in the same distributed system. Another example includes the Internet based World Wide Web (WWW) where a user's web browser running on a laptop is a client device and the web site is a server.
Client devices may also at times operate as servers and vice versa such as in peer-to-peer distributed systems where there is no “central” server. Instead, each client may also be a server to other clients.
Performance of a distributed system is affected by a number of factors. Server overloading can cause anything from minor delays seen at a client device while the server processes requests from other clients, to prolonged “freezes” where a client device may appear completely unresponsive to a user while it waits for a server response. Software and hardware problems at the server can have similar effects. Examples of software problems include configuration errors such as incorrectly assigned network addresses or security certificates, database problems such as missing or faulty indexes, and poor programming in general such as non-optimal algorithm design. Hardware problems can be due to failing disk drives and memory, overheating, and electrostatic and radio frequency (RF) interference, to name a few. Each of these issues may cause a server to suffer poor performance. Similar problems may also affect client devices, and the interconnecting network between a client and server may also contribute, sometimes severely, to performance problems. Computer networks typically involve interconnection between several intermediate control devices, for example, routers, gateways, and switches. These control devices can themselves become overloaded or suffer from hardware and software problems. Additionally, the various wired and wireless communication links of a network may be of different bandwidth capacities, the slowest of which will generally limit the maximum throughput and latency between a client and server.
Performance problems can wreak havoc on a distributed system, especially one that is related to customer service and operates in real-time. “Time is money” is an often used adage very applicable to performance problems in distributed systems. Taking a distributed call center system as an example, when a telephone agent spends a few minutes of each call in silence or explaining to the caller that the agent's computer is “acting up” while desperately trying to get the computer to hurry up and provide required information, this is a financial burden on the company. Customer satisfaction will be lowered and the company's reputation may suffer as a result. Sales could also be lost due to busy signals or long hold times for other callers trying to get connected with an agent. The company may only become aware of the problem when either customers or telephone agents begin complaining, at which time the company has certainly already been negatively affected. Furthermore, it may be very difficult to determine why the system is running so slow, and hiring extra telephone agents in an attempt to reduce the backlog of callers on hold may not help the situation because the extra usage of the distributed system by an increased number of agents may make it even slower.
To prevent performance problems, careful planning is needed to precisely calculate the exact technical requirements of a distributed system in order to handle the actual load. As it is nearly impossible in most practical cases to perfectly anticipate load, the typical solution is to simply over provision the whole system as much as possible and hope things don't get too slow during peak usage. However, over provisioning, especially for aspects of the system where not needed, is expensive and wasteful.