1. Field of the Invention
The present invention relates to a systematic, method and apparatus for reliability monitoring and, more particularly, to a finite state machine as part of a microprocessor chip and used to control and enhance the reliability and/or the performance of the microprocessor system. The present invention further relates to the ability to capture the manner in which mean time to failure (MTTF) varies as a function of the input workload executing on the microprocessor or microprocessor-based system and using this information in either enhancing reliability or boosting microprocessor performance.
2. Description of the Related Art
Advances in semiconductor (specifically, complementary metal oxide semiconductor (CMOS)) technology have been improving microprocessor performance steadily over the past few decades. However, such advances accelerate the onset of reliability problems. Specifically, one of the consequences of progressive scaling of device and interconnect geometries is the increase in average and peak power densities (and hence temperatures) across the chip.
The inherent increase in static (leakage) power with scaling into the deep sub-micron region, adds to these issues. In addition, the major components of leakage power increase with temperature, making the problem even harder to control. Despite advances in packaging and cooling technologies, it is an established concern, that the average and peak operating temperatures within key units inside a microprocessor chip will be higher with the progressive scaling of technology.
Already, to protect against thermal runaways, microprocessors (e.g., INTEL® Pentium 4™ and IBM® POWER5™) have introduced on-chip temperature monitoring devices, with mechanisms to throttle the processor execution speeds, as needed. The objective is to reduce on-chip power when maximum allowable temperatures are approached or exceeded.
Failure rates of individual components making up an integrated circuit (or a larger system) are fundamentally related to operating temperatures, i.e., these rates increase with temperature. As such, chips or systems designed to operate at a given average temperature range, are expected to fail sooner than specified, if that range is routinely exceeded during normal operating conditions.
Conversely, consider a case where a chip or system is designed to meet a certain mean time to failure (MTTF), at an assumed maximum operating temperature. In this case, the designed chip or system will be expected to have a longer lifetime, if the actual operating temperatures happen to be lower. Thus, it may be possible to “overclock” (or speed up) the processor during phases of the workload when the operating power and temperature values are well below the maximum temperatures assumed during the projection of expected MTTF.
Electromigration and stress migration effects in the chip interconnects are major sources of failures in a chip and, they both have a direct dependence on operating temperature. However, aspects of reliability degradation with CMOS scaling, are not solely due to the power and temperature implications. For example, time-dependent dielectric breakdown (TDDB) is an extremely important failure mechanism in semiconductor devices. With time, the gate dielectric wears down and fails when a conductive path forms in the dielectric.
With CMOS scaling, the dielectric thickness is decreasing to the point where it is only tens of angstroms. Coupled with the fact that there has been a general slowdown in the way the supply voltage is scaling down, the intrinsic failure rate due to dielectric breakdown is expected to increase.
Furthermore, TDDB failure rates also have a very strong temperature dependence. Thermal cycling effects, caused by periodic changes in the chip temperature are another factor that degrades reliability. Again, this factor is not directly related to the average operating temperature; rather, it is a function of the number of thermal cycles that the chip can go through before failure.
Since the power consumed by the chip (or system) varies with the executing workload, it is clear that the actual operating temperature and failure rate of a component (and hence of the system) depend on the workload.