1. Field of the Invention
The present invention relates to techniques for performing reliability tests on components in computer systems. More specifically, the present invention relates to a method and an apparatus that dynamically controls a temperature profile within a disk drive to facilitate a temperature-dependent reliability study on the disk drive.
2. Related Art
Computer system manufacturers routinely evaluate the reliability of individual computer system components to ensure that the computer systems manufactured from the components meet or exceed reliability requirements of their customers. Typically, component reliabilities are determined through “reliability-evaluation studies.” These reliability-evaluation studies can include “accelerated-life studies,” which accelerate the failure mechanisms of a component, or “repair-center reliability evaluations” in which a vendor tests components returned from the field. These types of tests typically involve using environmental stress-test chambers to hold and/or cycle one or more stress variables (e.g. temperature, humidity, radiation, etc.) at levels that are believed to accelerate subtle failure mechanisms within a component. The components under test are then placed inside the stress-test chamber and subjected to these stress conditions.
One of the key components in a computer system is a disk drive, which stores large amount of data on a non-volatile recording medium. However, disk drives are complex electromechanical devices which are subject to failures caused by triggering events related to a variety of environmental parameters. These environmental parameters can include temperature, shock/vibration, humidity, cooling air flow rate, etc. A disk drive failure can cause silent data corruption, permanent data loss and possibly an unrecoverable computer system crash. Consequently, reliability studies are frequently performed on the disk drives to understand their failure mechanisms and characteristics.
In particular, a significant number of disk drive failures are temperature related. For example, the reliability of both disk drive electronics (such as electron migration in a flash memory chip) and disk drive mechanics (such as the spindle motor and actuator bearings) degrades as temperature increases. In addition, high temperature environments in disk drives can cause thermal instability in the data stored in the recording medium, which over long periods of time can lead to permanent data erasing. Moreover, another serious failure mode: lubricant dry-out on the disk drive surface, is exacerbated by a high temperature.
To conduct accelerated-life studies on disk's drive thermal reliability, the disk drives are commonly loaded into thermal chambers where temperature is cycled in an effort to accelerate mechanisms that can lead to drive failure. This type of study on the disk drives requires the disk drives to be shipped to a facility housing such programmable thermal chamber. At the facility, a population of drives are placed in the thermal chambers and their temperature is cycled for fixed time intervals (e.g. 100 Hrs, 500 Hrs). The drives are then removed from the test chambers and installed into a storage array where their functionality is tested.
Unfortunately, the thermal chamber study has several drawbacks. Firstly, it requires the drives to be uninstalled and removed from the computer systems and shipped to the test facility, which involves additional shipping time and expense. Secondly, during the reliability-evaluation, it is usually not possible to apply pass/fail tests for the disk drives while they are inside the thermal chambers. Consequently, at the predetermined time intervals, the disk drives are removed from the thermal chambers and are evaluated “ex-situ.” Note that it is difficult to cycle temperatures for the drives while collecting real-time I/O performance information in an ex-situ reliability test. Furthermore, the thermal chamber study can only yield failure drive counts, without identifying the exact times of the onset of degradation in the drives. Note that it is desirable to obtain the exact times of drive failures to facilitate accurate long term reliability projections.
Hence, what is needed is a method and apparatus for performing in-situ temperature cycling for accelerated-life studies of disk drivers without the above-described problems.