With semiconductor architectures, failure mechanisms, such as electromigration and stress migration in interconnects, time-dependent dielectric breakdown, and thermal cycling accelerate with an increase in temperature. In particular, stress migration and time-dependent dielectric breakdown have an exponential temperature dependence.
Unmanaged temperatures in semiconductor architectures can create a temperature/leakage power feedback loop, yielding thermal runaway. High temperatures also can create timing errors and clock skew, and affect carrier mobility and threshold voltage in MOSFETs. Accordingly, it is very important to monitor and manage on-chip temperatures in order to maximize device lifetimes and assure computational correctness.
Temperature also can be used as an observable test output for determining defective integrated circuit components. In a many-core platform, chip hotspots are workload-dependent. In order to maximize performance and reliability in these devices, tasks should be scheduled in a thermally-aware manner so that hotspot temperatures do not exceed a set threshold value and thermal gradients are minimized.
The goal of thermal management is monitor and mange temperatures to maximize device performance while minimizing temperature gradients. Reducing temperatures and thermal gradients can be achieved through thermal-aware design or dynamic thermal management (DTM). In thermal-aware design, materials, physical structures, and floor plans are chosen so that thermal gradients are minimized. For example, a grid structure has been proposed to evenly distribute heat across an integrated circuit via lateral diffusion. Another example of thermal-aware design is the placement of L2 cache between cores in a multi-core system to thermally insulate them from each other.
Dynamic thermal management in integrated circuits roughly can be split into two domains: triggering mechanisms and response mechanisms. The goal of a triggering mechanism is to measure or estimate on-chip temperatures and trigger a hardware or software-level response which is a function of those temperatures. Temperature measurements are achieved with analog or digital on-chip temperature sensors. On-chip temperatures also can be indirectly estimated through static compile-time code profiling or high-level dynamic performance analysis. Purely indirect estimations lack any real temperature feedback and can only yield relative temperature information which are not sufficient for applications where absolute temperature measurements are required.
The goal of a response mechanism is to maximize device reliability while minimizing performance degradation. In this domain, certain actions may need to take place in order to reduce hot spot temperatures or minimize thermal gradients. The main tradeoff in this domain is system reliability vs. performance.