Not applicable.
1. Field of the Invention
The present invention generally relates to fault detection in computer systems. More particularly, the invention relates to the use of unique device identification codes stored in non-volatile memory to track failed devices in a computer system. Still more particularly, the present invention relates to a system in which failed components may be tracked physically through the use of stored or embedded identification codes.
2. Background of the Invention
A personal computer system includes a number of components with specialized functions that cooperatively interact to produce the many effects available in modern computer systems. These components typically include one or more central processing units (CPU""s), an array of random access memory (RAM) modules, and certain peripheral devices such as a floppy drive, a keyboard, and a display. The components generally are interconnected by one or more xe2x80x9cbusses.xe2x80x9d A bus is a collection of digital signal lines over which data, address, and control signals are conveyed between devices connected to the bus according to a predetermined protocol. Examples of industry standard bus protocols include the Peripheral Component Interconnect (PCI) bus, the Industry Standard Architecture (ISA) bus, and Universal Serial Bus (USB).
For a computer system to operate successfully and efficiently, its components must function correctly. To ensure proper operation in the event of a failed component, the computer system must be capable of (1) detecting the failure, and (2) isolating the failed component so it is no longer accessed. Accordingly, many computer systems include logic for detecting when a device has failed and isolating the failed device to prevent its subsequent use by other devices (such as the CPU) in the computer system. Although the sophistication of personal computer systems continues to increase, there continues to be a concern that components may fail during operation. To protect against this eventuality, fault detection systems continue to play an important role in the operation of computer systems. The present invention relates to an improved fault detection and isolation technique.
To understand conventional fault detection and isolation schemes, it is important to understand the interaction between the computer""s hardware components and the operating system (e.g., Windows(copyright) 95). Application software, such as a word processor or game, uses the operating system to access the computer""s hardware components to manipulate data. For example, a particular application program may require access to data on a hard disk drive. The operating system translates a data access request from the application program into one or more device level operations to obtain the requested data for the application program. The application program need not access the hard disk directly, but does so indirectly via the operating system.
Many devices, such as system memory and the CPU are assigned a xe2x80x9clogicalxe2x80x9d address during system initialization (xe2x80x9cboot-upxe2x80x9d). As such, it is common to refer to a xe2x80x9cphysicalxe2x80x9d device or a xe2x80x9clogicalxe2x80x9d device; the physical device refers to the actual hardware device, and the logical device refers to the device as it is mapped into the logical address space. For example, system memory may comprise 4 megabyte (MB) dual in-line memory modules (DIMM""s). Each physical DIMM, therefore, is a 4 MB xe2x80x9cphysicalxe2x80x9d DIMM. During boot-up, each physical DIMM is assigned a 4 MB logical address range. One physical DIMM might be assigned the 0-4 MB address range, while another DIMM might be assigned the 4-8 MB address range. The operating system accesses a particular memory location in each DIMM typically by using its starting logical address (0, 4 MB, etc.), and also an offset from the starting logical address to the targeted memory location.
Assigning logical addresses to physical devices permits efficient use of the computer""s physical resources by the operating system and applications software. Software can then be developed to run on computers with different hardware configurations; the software need only be aware of the logical addresses of the various devices. Further, if a user moves a physical device from one location in the computer to a new location, the logical address assignment may change (during boot-up) and the computer""s software and operating system will be unaffected other than being made aware of the device""s new logical address.
Most computer systems run various tests during boot-up in a process generally referred to as xe2x80x9cpower on self testxe2x80x9d (POST). The POST routines are part of the Basic Input Output System (BIOS) code that is stored in read-only memory (ROM) and executed by the CPU. During execution of the POST routines, the various devices in the computer system, such as the CPU and memory, are tested to ascertain whether each device is working properly. Different types of devices are tested in different ways. Memory, for example, is tested by writing (i.e., storing) a known test data value to the memory device to be tested, and then reading (i.e., retrieving) the data value from the memory device to ensure the value read matches the value written. If a match does not exist, the memory is deemed defective; otherwise, the device is assumed to be functional. A CPU typically includes logic to test itself. The operational state of a CPU can be ascertained by the BIOS code reading the contents of various status registers internal to the CPU that indicate the CPU""s functional states. Device testing also occurs to a certain extent after POST while the computer system is undergoing normal operation.
After the computer""s hardware devices are tested, the BIOS code provides the operating system with a Logical Resource Map (LRM) which includes the logical addresses of only those devices that are fully functional. The operating system will not permit access to those logical devices not listed in the LRM, thereby isolating those devices from use in the computer. Further, if a device fails during operation of the computer and the failure is detected, the logical resource map is changed to indicate to the operating system that the failed device is no longer available.
The CPU also uses the BIOS code to maintain a list of failed logical devices in a xe2x80x9cfailed device logxe2x80x9d (FDL) stored in non-volatile memory (i.e., memory whose contents are not erased when power is removed from the device). During boot-up, the BIOS code reads the failed device log to determine which logical devices were previously reported as failed. As the BIOS code creates the logical resource map to be provided to the operating system, the BIOS code will not include those logical devices that have been reported previously as failed. Accordingly, fault detection and isolation involves determining that one or more of the computer devices is defective, and prohibiting further access to that device by the operating system even after the computer has been turned off and then re-started.
The user, however, may wish to take remedial actions when the computer reports the presence of a failed device. For example, if the BIOS code determines that a CPU is defective, the user may replace the defective CPU with a new CPU. If a memory device has failed, the user may wish to replace the defective memory device or simply add additional memory modules without removing the defective device. In some situations, only a portion of the memory device has failed and most of the memory locations in the memory device may still be fully functional. As such, the user may not wish to replace the memory device. Instead, the user may leave the partially defective memory device in the computer and add an additional memory device to make up for the loss of memory capacity resulting from the defective memory locations.
When repairs or alterations to the computer configuration are made, the possibility exists that the FDL will no longer match the physical configuration of the computer. The following examples illustrate this problem. If a user removes a defective device, such as a CPU, and replaces it with a new device, the new CPU likely will be assigned the same logical address as the defective CPU. The FDL identifies which logical devices had previously been reported as failed. Upon subsequent system initialization, the BIOS will read the FDL and erroneously determine that the device at that logical address is still defective. Further, the new logical CPU will not be included in the LRM. Unless the user tells the computer system that the defective CPU has been replaced, the operating system will not permit access to the new CPU simply because the logical address associated with that new device is still tagged as failed in the FDL, and accordingly is not included in the LRM. This erroneous result can be remedied by the user running a known utility program that resets the failed device log so that upon subsequent boot-up, the logical address associated with the replaced device is associated with a previously identified failed component. This solution, however, places a burden on the user to know that it is necessary to run such a utility program, to know which utility program to run, and how to run he program.
Another example of the mismatch that can occur between the physical configuration of a computer and the failed device log relates to the system memory. Most computer systems available today include system memory comprising one or more memory modules. The memory modules may be implemented as single in-line memory modules (SIMM""s), dual in-line memory modules (DIMM""s), or any other type of available memory technology. These memory modules typically include connectors that mate with corresponding connectors (commonly referred to as xe2x80x9cslotsxe2x80x9d) on the computer system xe2x80x9cmotherxe2x80x9d board. Many computers include connectors for eight or more memory modules, although not all of the available slots need be populated with memory modules.
The BIOS code assigns a logical memory address range to each memory module on the mother board. If one of the memory modules is found to be defective and, instead of replacing the defective memory module, the user simply adds a new module, it is possible for the newly inserted memory module to be assigned the logical address range previously assigned (i.e., assigned the last time the computer was turned on) to the defective memory module. Further, the defective memory module may be given a logical address range different than previously assigned to the defective module and not previously tagged as failed in the failed device log. During subsequent boot-up, the BIOS will read the FDL and, unless a suitable utility program is run by the user, the BIOS will report to the operating system via the logical resource map that the newly inserted memory module is defective. This erroneous result occurred because the new memory module was assigned the logical address range previously assigned to the defective module. Further, the BIOS will also erroneously report to the operating system that the defective memory module is available for use because its logical address range had not been previously tagged as failed in the failed device log. These problems can be remedied by the user running a utility program to inform the BIOS code of the new memory configuration. As noted above, however, running the utility program places an undesirable burden on the user.
Accordingly, a computer system that solves the problems noted above would be beneficial. Such a computer system preferably would be capable of accurately tracking failed devices even if those devices have been replaced or assigned to different logical addresses. Further, the new computer system preferably reduces or eliminates the necessity of the user running utility programs when a physical device is removed or replaced or a new device is added to the computer. Despite the advantages that such a system would offer, to date no such system has been introduced.
The deficiencies of the prior art described above are solved in large part by a computer system implementing a fault detection and isolation technique that tracks non-functioning physical devices by codes which are embedded, or otherwise stored, in particular computer components. Examples of such devices in which codes could be embedded include, for example, the CPU (or CPU""s if multiple processors are included in the computer system), memory modules comprising the computer""s main system memory, and peripheral components residing on an expansion bus. The use of embedded or storage codes enables the fault detection and isolation technique to track physical devices instead of logical devices.
In accordance with the preferred embodiment, the computer system comprises one or more CPU""s, one or more memory modules, a master control device, such as an I2C master, and a North bridge logic device coupling together the CPU, memory modules, and master control device. The master control device also connects to the CPU and memory modules over a serial bus, such as an I2C bus. Each CPU and memory module, and any other device for which physical tracking is desired, includes an internal non-volatile memory storage unit. If the device fails, the computer system writes an error code or message to the internal memory storage unit. The error code identifies the device as failed and preferably indicates the cause or symptom of the failure. The error code also may contain any other desired information.
During system initialization, the CPU creates a logical resource map which includes a list of logical addresses of all available (i.e., fully functional) devices. The logical resource map is provided to the computer""s operating system and only permits access to those logical devices listed as available in the logical resource map. To determine which components are working, the master control device searches the internal memory storage unit of each component for failure codes. If the master control device finds a failure code stored in one of the devices, the CPU omits the logical address associated with that physical device from the logical resource map. The operating system will prevent access to a logical device that is not listed in the logical resource map.
In an alternative embodiment, each physical device includes an ID code that uniquely identifies and distinguishes that device from all other devices in the computer system. The computer also includes a non-volatile memory coupled to the CPU by way of the North bridge device. After a device is determined to be non-functional, either during POST or during normal system operation, a CPU stores that device""s unique ID code in a failed device log in the non-volatile memory. The CPU then creates or modifies the logical resource map according to the list of failed physical devices.
In accordance with the alternative embodiment, a CPU, during computer system initialization, reads the list of ID codes from the failed device log, and the master control device retrieves the ID code from each physical device connected to the master control device. The master control device provides the retrieved ID codes from each device to the CPU, which then compares the list of ID codes from the failed device log with the list of ID codes retrieved from the devices by the master control device. If one of the device ID codes matches an entry in the failed device log, the CPU omits the logical address associated with that physical device from the logical resource map. The operating system will prevent access to a logical device that is not listed in the logical resource map.
The centralized technique of using unique ID codes to maintain a failed device log may be combined with the more distributed approach of storing failure information directly in the computer components. During system initialization, the CPU creates a logical resource map based on ID codes in the failed device log and on error information stored in the individual components. If the data in the failed device log disagrees with the failure information stored within a particular component, then the failed device log preferably is updated to reflect the error information stored within the component. Thus, the failed device log is updated automatically during system initialization if a failed component is replaced with a working component.
Tracking failed devices using a failure code and/or an ID code unique to each device can eliminate the pitfalls associated with tracking logical devices. The various characteristics described above, as well as other features, will be readily apparent to those skilled in the art upon reading the following detailed description of the preferred embodiments of the invention, and by referring to the accompanying drawings.