As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
One or more cooling fans are typically employed within the chassis of information handling system platforms, such as servers, to cool components operating within the information handling system chassis. Such cooling fans may be uncontrolled, i.e., running at full power whenever the information handling system is a powered on state. However, cooling fans consume power, create noise, and create airflow, each of which becomes of greater concern in a data center where a plurality of information handling system platforms may be operating, e.g., as servers. Cooling fans may also be controlled based on ambient temperature within an information handling system chassis.
Thermal tables have been provided in system memory that specify fan speed RPM values for each respective cooling fan of an information handling system platform at a given temperature (or alternatively at a given range of sensed temperature) within the chassis enclosure of an information handling system platform. The specified fan speed (e.g., RPM) values and baseline temperature response of such a thermal table are pre-defined based on thermal engineering and default thermal loadings for different system components, and are selected to help ensure sufficient cooling of the components of a given default system configuration that includes a specific default number and type/s of system components. As the sensed operating temperature within the system chassis increases or decreases, the fan speed of each of the given system cooling fans is automatically increased or decreased according to a pre-defined linear (X-Y) relationship of the thermal table that specifies increasing fan speed with increasing temperature.
Thermal control techniques have been developed for information handling system platforms in an attempt to reduce power consumption, airflow and acoustic noise generated by cooling fans. Such techniques include proportional-integral-derivative (PID) control loop feedback. Closed loop thermal control techniques have utilized temperature information provided from individual components of an information handling system platform, such as central processing unit (CPU), hard disk drive (HDD) and redundant array of independent disks (RAID) hardware (RAID card). Conventional CPU thermal control for information handling system platforms such as servers is typically implemented based on a combination of register values that are read directly from the CPU and system manufacturer-defined table values that are based on system characterization and maintained in system memory. Although many server components have a fixed temperature requirement that is stable throughout development of an information handling system architecture, CPU thermal requirements are often changed by the CPU manufacturer for given type of CPU during the course of the information handling system platform architecture development. To further complicate matters, CPU thermal requirements may also vary as a function of CPU stock keeping unit (SKU), and loading and performance settings, e.g., such as enhanced halt state (C1E) disabled. CPU manufacturers reserve the right to change CPU thermal requirements and register values until late in system manufacturer development phases. Occasionally, the requirements are incorrectly documented. Other times, qualification sample parts are incorrectly programmed with faulty values, resulting in bad thermal settings in system manufacturer thermal control tables.
FIGS. 1-4 illustrate CPU die operating temperature as a function of time where conventional thermal control setpoint methodology is used to control information handling system cooling fan speed in real time based on sensed changes in CPU die operating temperature of a server for four respective different CPU devices of the same type of CPU device. In each case, the real time server CPU temperature is sensed by integrated CPU digital thermal sensing circuitry and reported (as an offset value relative to CPU thermal throttling temperature threshold described below) to a server baseboard management controller (BMC) that controls cooling fan speed based on the reported CPU temperature using fixed controller gains. In FIGS. 1-4, the CPU thermal throttling temperature threshold is a register value maintained by the given CPU device that represents a CPU temperature threshold set by the CPU manufacturer and above which the CPU provides a critical temperature warning and a CPU thermal throttling control circuit of the CPU is activated by a thermal monitor of the CPU to reduce the CPU die temperature using clock modulation and/or by throttling down the CPU clock speed and operating input voltage until the sensed CPU temperature drops again drops below the CPU thermal throttling temperature threshold. A separate CPU fan control target temperature setpoint value is a static CPU register value that is set by the CPU manufacturer below the maximum CPU temperature (or CPU thermal throttling temperature threshold). The CPU fan control target temperature setpoint value is a target or desired CPU die operating temperature that is set by the manufacturer for the individual CPU device.
In FIGS. 1-4, the CPU fan control target temperature setpoint value is intended to be read from the CPU register and used by the BMC as a setpoint for increased fan speed control for cooling the CPU in an attempt to maintain the real time CPU die operating temperature at the retrieved CPU fan control target temperature setpoint value and to prevent the real time CPU die operating temperature from reaching the CPU thermal throttling temperature threshold. However, actual CPU cooling characteristics may vary due to the particular cooling characteristics of a given server system platform configuration, including the particular combination and geometry of system heat-generating components, cooling fans and chassis enclosure. Thus, CPU thermal throttling temperature threshold may be exceeded in some cases due to system cooling capability limits even when the server BMC uses the retrieved CPU fan control target temperature setpoint value to trigger the system cooling fans. In an attempt to prevent this from occurring, a manufacturer of an information handling system platform may decide to set its own thermal control setpoints during system development that are more stringent than the maximum CPU temperature/CPU thermal throttling temperature threshold and CPU fan control target temperature setpoint values. These system manufacturer thermal control setpoints may be used to control cooling fan speed based on actual information handling system component configuration, chassis configuration, and system cooling fan characteristics/capacity in an attempt to ensure that the CPU temperature never reaches CPU thermal throttling temperature threshold and CPU thermal throttling activation even when the CPU fan control target temperature setpoint value is set too high for the actual server system cooling capability.
Specifically, in FIGS. 1-4 a system fan control target setpoint temperature value may be set by a manufacturer of an information handling system platform at a CPU temperature below the CPU fan control target temperature setpoint value and stored in BMC non-volatile memory as a fixed fan control target (FTC) offset value below the fixed CPU fan control target temperature setpoint value read from the CPU register of a given CPU of a particular information handling system platform instantiation as shown by the downward arrow in FIGS. 1-4. Thus, the system manufacturer system fan control target setpoint temperature value across different information handling system platform instantiations is allowed to automatically move upward or downward as CPU manufacturer changes to the CPU fan control target temperature setpoint value occur between different CPU devices, e.g., changes due to different CPU stock keeping unit (SKU), different loading and performance settings, etc. In this regard, the CPU fan control target temperature setpoint value is a fixed value for each individual CPU device part, but may vary between different individual CPU device parts of the same type, such that a first individual CPU device installed in a first information handling system platform instantiation is provided by the CPU manufacturer with a different CPU fan control target temperature setpoint value than a second individual CPU device of the same CPU type of CPU device that is installed in a separate second information handling system platform instantiation. Thus, the resulting system manufacturer system fan control target setpoint temperature is automatically set by the BMC of the second information handling system platform to a different temperature value than the system manufacturer system fan control target setpoint temperature set by the BMC of the first information handling system platform, even though the second CPU device installed in the second information handling system platform is the same type of CPU device as the first CPU device installed in the first information handling system platform.
Still referring to FIGS. 1-4, the information handling system manufacturer may also set a specified system power capping threshold temperature value in BMC non-volatile memory that is below the CPU thermal throttling temperature threshold and above the CPU fan control target temperature setpoint value specified and set by the CPU manufacturer. This system manufacturer power capping threshold temperature value is supposed to be represent a CPU temperature value below the CPU thermal throttling temperature threshold where the system BMC initiates CPU throttling. However, as further described below, CPU thermal throttling temperature threshold may be changed upwards or downwards between different CPU devices of the same type without notice by the CPU manufacturer, sometimes in a manner that places CPU thermal throttling temperature threshold below the system manufacturer power capping threshold temperature value.
FIG. 1 illustrates an example where the conventional thermal control setpoint methodology descried above is working correctly for a first given CPU device. As shown in FIG. 1, CPU die operating temperature increases with time and triggers cooling fan speed increase by the system BMC when the CPU temperature reaches the system fan control target setpoint temperature value set by the information handling system manufacturer. With the increased cooling fan speed, the CPU temperature is controlled to be the specified system manufacturer system fan control target setpoint threshold and such that the CPU temperature does not exceed either the system manufacturer power capping threshold temperature or the CPU thermal throttling temperature threshold, and thus CPU thermal throttling control is not activated and no CPU throttling or temperature warnings are required.
In the conventional example of FIG. 2, the CPU manufacturer has relaxed (or raised) the CPU fan control target temperature setpoint value stored in the register of a second and different given CPU device that is the same type of CPU device as the first CPU device of FIG. 1. This may be done by a CPU manufacturer, for example, in an attempt to increase performance of a given type of CPU and/or to decrease cooling fan use. As shown in FIG. 2, the CPU fan control target temperature setpoint value has been raised in this case above the power capping threshold temperature value set by the information handling system manufacturer, and this in turn raises the system manufacturer system fan control target setpoint value used by the system BMC. Thus, in FIG. 2 cooling fan speed increase is not triggered until a higher CPU temperature than was the case in the example of FIG. 1. This causes the CPU die operating temperature to overshoot or exceed both the system manufacturer power capping threshold temperature and the CPU thermal throttling temperature threshold, in turn causing CPU throttling and CPU thermal throttling activation before the CPU temperature is eventually controlled to the system manufacturer system fan control target setpoint temperature.
In the example of FIG. 3, the CPU manufacturer has lowered the CPU fan control target temperature setpoint value stored in the register of a third and different given CPU device that is the same type of CPU device as the first and second CPU devices of respective FIGS. 1 and 2. This lowering of CPU fan control target temperature between different CPU devices is represented by the downward pointing cross-hatched arrow in FIG. 3. Thus, in FIG. 3, cooling fan speed increase is triggered earlier at a lower CPU temperature than was the case in the example of FIG. 1. In this case, the temperature control gain value/s used by the BMC to regulate cooling fan speed are tuned by the information handling system platform manufacturer for the original higher CPU fan control target temperature setpoint value of FIG. 1. Consequently, CPU temperature and fan speed response are not stable, but rather oscillates as shown above and below the system manufacturer system fan control target setpoint temperature and may exceed the system manufacturer power capping threshold as shown, resulting in undesired CPU throttling initiated by the BMC.
In the example of FIG. 4, the CPU manufacturer has lowered the CPU thermal throttling temperature threshold value stored in the register of a fourth and different given CPU device that is the same type of CPU device as the first, second and third CPU devices of respective FIGS. 1, 2 and 3. In this case the CPU thermal throttling temperature threshold has been lowered below the fixed or static power capping threshold temperature value set in the BMC by the information handling system platform manufacturer. Consequently, CPU temperature is allowed to exceed the new CPU thermal throttling temperature threshold before reaching the system manufacturer power capping threshold temperature, which results in the undesirable consequences of a critical temperature warning and CPU thermal throttling activation described above.
As illustrated in FIGS. 1-4, there is a significant risk of having information handling system manufacturer-specified thermal control parameters/setpoints that aren't optimized for the CPU thermal control parameters/setpoints of the CPU devices that are actually installed and/or shipped with a given information handling system platform instantiation. This can result in factory or field failures as a result of this problem. Moreover, changes in CPU manufacturer thermal requirements and/or thermal parameter register values creates system manufacturer workload churn for the thermal control tuning and validation as well as driving additional builds and code validation. Additionally, taking advantage of temperature relief provided by the CPU thermal profile results in risk of CPU thermal throttling activation.
A CPU thermal profile has also been specified by the CPU manufacturer and stored in the CPU register to define a relationship between a CPU fan control target setpoint temperature and CPU operating power. In such a case, the BMC may read the particular value of CPU fan control target setpoint temperature from the CPU thermal profile at the current CPU operating power, and use this read CPU thermal profile value as the fan control target setpoint temperature value for fan speed control at the current CPU operating power. The fan control target setpoint temperature from the CPU thermal profile increases with increasing CPU operating power to allow the CPU fan control target setpoint temperature value to eventually equal the specified CPU thermal throttling temperature threshold value for the CPU, which causes risk of CPU thermal throttling activation in the case of slight CPU temperature overshoot above the CPU thermal throttling temperature threshold value.