Internet-based services have been growing in coverage and scale. Various kinds of services are provided in various application fields, attracting an increasing number of users. Since the Internet can be accessed from anywhere at any time, the systems operating such services tend to experience significant variations of demand. Preparing server resources for peak demand could therefore lead to overcapacity in ordinary times, wasting most resources. Stated in reverse, optimizing the amount of server resources for ordinary usage could result in shortage of processing capacity at the time of peak demand and a consequent loss of business opportunity.
Utility computing (UC) techniques have been developed as a solution for the above problem. UC systems optimize the allocation of server resources dynamically in accordance with time-varying demand. “Capacity on Demand” (CoD) is another name of this type of techniques.
FIG. 22 is a simplified block diagram depicting an Internet data center (IDC) as an example of a UC system. The illustrated IDC 900 is formed from systems 910 and 920 serving UC users, a server pool 930, and an IDC manager 940.
One system 910 includes a service providing section 911 to provide services, and a system manager 912 to manage the system 910. Another system 920 includes a working system 921a and a backup system 921b constituting a service providing section 921 with a dual-redundant structure. Also included in the system 920 is a system manager 922.
The server pool 930 is a collection of spare servers 931, 932, . . . for shared use by a plurality of IDC users. Each system 910 and 920 is formed from as many servers as necessary to handle regular demand from clients 961 and 962 on the Internet 950. Those systems may, however, encounter a surge of demand exceeding their processing capacity. In such a case, the IDC manager 940 supplies the requesting system with as many spare servers 931, 932, . . . as necessary, out of the shared server pool 930.
The above mechanism permits the systems 910 and 920 to receive a sufficient amount of server resources when they are necessary. The service fees paid to the provider of IDC 900 is on a usage basis. The users of such IDC services enjoy several advantages, i.e., more choices for their IT investment, reduction of operational costs, and expedited solutions for their business.
Conventional UC systems determine which servers to allocate and how many servers to allocate, based on performance requirements (e.g., processing capacity of servers). When the demand increases, the UC system allocates servers to offer a minimum required performance. The UC system also uses demand estimation techniques to estimate variations of demand, thereby identifying a period when the expected demand may overwhelm the currently allocated servers. This period is referred to herein as an estimated capacity shortage period. By dynamically allocating servers based on such forecasts, the UC system can be prepared for an actual surge of demand.
The allocated servers work in a target system to satisfy its performance requirement. Such a target system is also designed to provide an enhanced reliability, not to stop providing services due to a server failure. The process of dynamic allocation thus selects servers taking it into consideration their differences in reliability. For example, Japanese Laid-open Patent Publication No. 07-271699 proposes a technique for allocating peripheral devices based on the failure rates of those individual peripheral devices.
Conventional UC systems, however, select servers for dynamic allocation, depending solely on performance requirements, rather than including reliability of servers. Performance indices used for this purpose depend on the model of servers and the clock frequency of their central processing units (CPU). The same servers with the same CPU clock frequency are regarded as having the same performance and thus treated equally in the dynamic allocation. Actually, however, those servers may different levels of reliability even if they have comparable performance indices. More specifically, they have different failure rates, availabilities, mean time to failure (MTTF) values, mean time to repair (MTTR) values, or other reliability parameters, depending on how long they have been operating. (While MTBF, or mean time between failures, may be used as a synonym of MTTF, the rest of this description will use MTTF.) For this reason, server allocation based on performance requirements alone could select a server that performs satisfactorily at the moment but has a potential problem in its reliability. The lack of sufficient reliability could lead to degradation of services because of failure of allocated servers.
If such an unreliable server is failed, the UC system will replace that server with a newly allocated server to make up for the resulting shortage of processing capacity. This allocation of an alternative server is performed upon detection of a capacity shortage due to server failure. The UC system has therefore to operate for a while with insufficient server resources until the allocation is made, which could lead to a loss of business opportunity. Furthermore, the resulting downtime could spoil the system's reputation for reliability.
As can be seen from the above discussion, conventional UC systems have a drawback in their dynamic server allocation functions. Their lack of reliability makes it difficult to implement such functions in the applications that require reliability in the first place (e.g., financial systems and air-traffic control systems).
Those reliability-oriented systems often adopt a redundant structure, which replaces a working system with a backup system in case of failure. The design and setup of such a redundant system need work of human engineers. The system in operation may require additional servers to deal with an increased service demand. Each time the need for a modification to the system arises, the system engineers have to reconfigure the system manually, taking its performance and reliability into consideration. It is therefore difficult to apply the system to Internet-based services, the demand of which tends to vary significantly.
Some systems may refer to failure rates in determining which device to allocate, as mentioned earlier. Those systems, however, do not consider such a reliability requirement that the allocated devices are supposed to operate properly during the expected capacity shortage period.