Operators of data centers seek ways to reduce operating expenses (opex) of their data centers and improve their performance. A data center typically includes many servers (among other computing systems/applications) designed to execute different applications. In most cases, servers are designed or tuned to perform well across many different types of applications. However, maintaining high server performance for every possible application cannot be performed by the currently available solutions. For example, current state-of-the-art methods for server tuning cannot dynamically adapt to two different applications executed by a server, and achieve highest performance in each of them.
Another challenge in managing computing systems in data centers is to manage the tradeoff between energy consumption for a unit of work and performance. That is, tuning a server to achieve higher performance typically would increase the energy consumption for a unit of work.
One of the major data center operation expenses is energy, consumed largely by servers and coolers. An important data center energy efficiency metric is the power usage effectiveness (PUE) rating. The PUE rating compares the total data center power consumption to the amount of power consumed by the IT equipment. The ideal PUE is 1.0. When computing the PUE, any device that consumes power in the data center is considered such as, e.g., lighting, cooling, and so on. A PUE rating of 2.0 means that, for each watt consumed by the servers, another watt is consumed by the data center infrastructure. Thus, it is desirable to reduce the power consumption of the servers, as such reduction would also reduce the power consumption of the infrastructure of the data center.
Several solutions have been proposed to reduce the power consumption of data centers. Some solutions relate to infrastructure of the data centers, while others deal with the hardware resources of servers or other devices in the data centers.
For example, low-power processors may be a simple solution to reduce power consumption. However, such processors pose performance limitations, and thus may not be a desirable solution. Memory controllers, adapters, disk drives, and other hardware peripheral devices account for a large fraction of the power consumption of a computer server, and cannot be neglected. CPUs and these peripheral devices employ power management features that help in reducing power consumption. However, each peripheral device is independently power-managed and is not optimized with respect to the executed application and/or operation of other peripheral devices.
A computing server typically includes various hardware, firmware, and software components. Some proposed solutions discussed in the related art include manually tuning certain parameters of a server's components to a set of benchmarks for energy, performance, or power capping. However, such a solution tends to suffer from high labor costs and suboptimal results.
Further, manual tuning of servers is a complex process for several reasons: the optimal settings of parameters may differ from one application to another; the optimal settings of parameters for an application may change from one hardware configuration to another; etc. In addition, the complexity of a tuning process results from the high number (typically 100's in today's systems) of tunable parameters that depend on each other.
Due to the complexity and the time required, manual tuning is performed by experts, if at all, only on a subset of applications, parameters, or hardware configurations, thereby achieving sub-optimal performance. Further, tuning needs to be performed on an on-going basis, thereby incurring additional labor costs.
Only suboptimal optimization can be achieved by manual tuning, as such tuning is not responsive to the current workload of the server. That is, a current workload may differ from the benchmarks, and the workload itself may exhibit different phases of execution requiring a different set of parameter values for each phase.
In sum, the existing solutions for the above-mentioned problems cannot adapt dynamically to changes in the application(s), process(es), and task(s) that are being performed.
It would therefore be advantageous to provide a solution that would overcome the deficiencies noted above.