As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
Information handling systems (IHSs) typically include various types of computer readable memory, such as read only memory (ROM), random access memory (RAM), Flash memory, etc. In some cases, for example, an IHS may include system memory for storing program instructions and/or data, which is accessible and executable by host processor of the IHS. In some cases, the system memory may include a plurality of dual in-line memory modules (DIMMs), each containing one or more RAM modules mounted onto an integrated circuit board.
When an information handling system initially powered on or rebooted, the host processor runs memory reference code (MRC), as part of the basic input/output system (BIOS) code, to initialize memory components of the IHS (including system memory and other memory components) during power-on-self-test (POST). The MRC includes memory configuration setting information (such as timing, driving voltage, etc.), which is configured by the MRC during the IHS boot process, and used to access the memory components during operating system (OS) runtime. When an IHS is initially powered on or rebooted, the MRC runs a margining test routine at the existing boot temperature to determine a single set of memory configuration settings for a memory component, which are stored and subsequently used to access the memory component.
During a MRC margining test routine, memory timing (frequency) is increased while writing/reading to/from the memory component, until an upper margin frequency value is reached where an error occurs that causes the memory to fail. The memory timing is also decreased while writing/reading to/from the memory component, until a lower margin frequency value is reached where an error occurs that causes the memory to fail. A memory timing value is selected as the median timing frequency value between the upper and lower margin frequency values and stored in the MRC configuration settings. A memory drive voltage value is likewise selected as the median memory drive voltage value between similarly determined upper and lower margin drive voltage values, and stored in the MRC configuration settings. The selected values of memory timing and memory drive voltage are typically saved in non-volatile memory (such as, e.g., serial peripheral interface, SPI, Flash memory) for future use by the information handling system.
The MRC margining test routine is typically run early in the boot process while the memory temperature, and the temperature inside the information handling system chassis, remains near room temperature (e.g., near approximately 20-25° C.). During OS runtime, however, the temperature of (and surrounding) the memory component may increase or decrease by a significant amount. In some cases, a thermally induced memory failure may occur on a memory component when an operating temperature of the memory component causes the memory voltage/timing requirements needed to successfully access the memory component to exceed (or fall below) the MRC memory configuration settings, which were specified for the memory component during the most recent IHS boot process. Unlike other types of memory failure, a thermally induced memory failure is a temporary (i.e., non-permanent) memory failure that occurs on an otherwise “good” memory component.
Thermally induced memory failures are a common cause for information handling system failures that often cannot be duplicated when the memory component is returned to the service center for repair. This makes it difficult to identify the cause of a system failure and which parts of the system may be faulty. In some cases, a failed memory component may be returned to the end user, or sent to another end user, if the memory failure couldn't be duplicated by the service center. This could result in an end user experiencing repeated memory failures.
Some information handling systems include a software algorithm that functions to detect memory failures, including thermally induced memory failures and other types of memory failures. One example is the Reliable Memory Technology (RMT) algorithm provided on many Dell information handling systems. When any type of memory failure is detected, the RMT algorithm permanently blocks out areas of the memory component that are deemed to be “bad” by writing the bad memory ranges to another memory component of the IHS. If a thermally induced memory failure occurs on an RMT enabled system, an otherwise “good” memory component may have regions that are permanently disabled.