1. Field of the Invention
This invention relates to the field of information handling and/or computer system design, operations, and maintenance. More specifically, this invention relates to a process for performing diagnostic tests on device operably connected to an information handling and/or computer system while the system""s operating system is running.
2. Description of the Related Art
Information handling systems in general have attained widespread use in business as well as personal computing environments. An information handling system, as referred to herein, may be defined as an instrumentality or aggregate of instrumentalities primarily designed to compute, classify, process, transmit, receive, retrieve, originate, switch, store, display, manifest, detect, record, reproduce, handle or utilize any form of information, intelligence or data for business, scientific, control or other purposes. The information handling system may be configured for a specific user application or requirement such as financial transaction processing, airline reservations, enterprise data storage and/or global communications. In general, an information handling system may include a variety of hardware and/or software components that may be configured to provide information and/or consume information. An information handling system may include one or more computer systems, data storage systems, and/or networking systems.
A computer system, which is one common type of information handling system, may be designed to give independent computing power to one or a plurality of users. Computer systems may be found in many forms including, for example, mainframes, minicomputers, workstations, servers, clients, personal computers, Internet terminals, notebooks, personal digital assistants, and embedded systems.
A computer system may be available as a desktop, floor-standing unit, or as a portable unit. The computer system typically includes a microcomputer unit having a processor, volatile and/or non-volatile memory, a display monitor, a keyboard, one or more floppy diskette drives, a hard disc storage device, an optional optical drive, e.g., DVD, CD-R, CD-RW, Combination DVD/CD-RW or CD-ROM, and an optional printer. A computer system also includes a commercially available operating system (xe2x80x9cOSxe2x80x9d), such as Microsoft Windows XP(trademark) or Linux.
A computer system may also include one or a plurality of peripheral devices such as input/output (xe2x80x9cI/Oxe2x80x9d) devices coupled to the system processor to perform specialized functions. Examples of I/O devices include keyboard interfaces with keyboard controllers, floppy diskette drive controllers, modems, sound and video devices, specialized communication devices, and even other computer systems communicating with each other via a network. These I/O devices are typically plugged into connectors of computer system I/O interfaces such as serial interfaces and parallel interfaces, for example. Generally, these computer systems use a system board or motherboard to electrically interconnect these devices.
As used herein, the terms xe2x80x9cinput/output devicexe2x80x9d and xe2x80x9cI/O devicexe2x80x9d include but are not limited to modems, sound and video devices, controllers, specialized communication devices, mass storage devices such as hard discs, compact disc (xe2x80x9cCDxe2x80x9d) drives of many varieties, magneto-optical drives, other data storage devices, and remote terminals and processors that exchange information and data with a computer system, including exchanges over conductive means, e.g., telephone circuits, intranets, local area networks, and the Internet.
Computer systems also typically include basic input/output system (xe2x80x9cBIOSxe2x80x9d) programs to ease programmer/user interaction with the computer system devices. More specifically, BIOS provides a software interface between the system hardware and the operating system/application program. The operating system (xe2x80x9cOSxe2x80x9d) and application program typically access BIOS rather than directly manipulating I/O ports, registers, and control words of the specific system hardware. Well known device drivers and interrupt handlers access BIOS to, for example, facilitate I/O data transfer between peripheral devices and the OS, application program, and data storage elements. BIOS is accessed through an interface of software interrupts and contains a plurality of entry points corresponding respectively to the different interrupts. In operation, BIOS is typically loaded from a BIOS ROM or BIOS EPROM, where it is nonvolatily stored, to main memory from which it is executed. This practice is referred to as xe2x80x9cshadowingxe2x80x9d or xe2x80x9cshadow RAMxe2x80x9d and increases the speed at which BIOS executes.
Although the processor provides the xe2x80x9ckernelxe2x80x9d of the computer system, I/O communication between an I/O device and the processor forms a basic feature of computer systems. Many I/O devices include specialized hardware working in conjunction with OS specific device drivers and BIOS routines to perform functions such as information transfer between the processor and external devices, such as modems and printers, coupled to I/O devices.
Computer systems typically include peripheral memory storage devices such as optical disc drives with removable storage media. The removable storage media is typically used to store and/or load software, data, and documentation. Examples of optical disc drives with removable storage media include audio CD, CD-ROM, CD-R, CD-RW, DVD and combination DVD/CD-RW. To load the selected operating system on to the hard disc for the first time, the PC manufacturer typically uses a boot device to initially boot up the PC. A boot device may typically include a floppy disc or a CD-ROM.
Full memory testing on information handling systems such as high-end servers featuring non-uniform memory access (NUMA) architecture can take a long period of time when the server is booted because of the large amount of random access memory (RAM) contained in the system and the latency associated with remote memory access. Additionally, these servers are often used for mission-critical tasks that require that they be kept operating without interruption. They are therefore generally not rebooted during time periods extending up to several years. When such a system is rebooted after a long period, many users find that various components have failed.
Once a system is taken out of operations or is booted to conduct tests, memory tests can take a long period of time for a number of reasons. First, as already mentioned, the large amount of memory (up to 64 gigabytes (GB)) in some systems can simply take a long time to test. Second, the types of testing required to test a memory thoroughly can take a long time to execute. For example, often used is a walking one-bit, walking zero-bit test, which includes changing a single bit within an address line (generally 512 bits) to determine if a single bit affects any other bits within the data path. As another example, a byte addressing test is used to determine that addressing of a memory location is unique across address lines. Finally, the additional latency of testing remote memory from a system boot strap processor (BSP) further lengthens the testing time.
Generally, memory testing in these high-end servers is run when the system is booted. As noted above, these boots are generally infrequent, and the boot testing keeps the system out of service for a long period of time. One alternative used is to test the memory with diagnostic software. A disadvantage of this alternative is that it is generally performed during infrequent downtime, so the tests are widely separated in time. Another disadvantage is that it is difficult to perform low-level diagnostic testing while the operating system is running the system.
What is needed is a system and method of performing testing of computer system components including memory components, including low-level testing, in high-end servers, including NUMA servers, while the OS is running, to avoid the problems associated with infrequent testing to keep systems in operation, and with keeping systems out of service during long testing periods.
A system and method are presented for performing testing of computer system components including memory components, including low-level testing, in computer systems while the operating system is running, including but not limited to high-end NUMA servers.
A method is presented, comprising making a device unavailable for use by a running operating system; making a processor unavailable for use by the running operating system; and running a diagnostic test of the device, the processor executing the diagnostic test to determine whether the device is functioning acceptably, while the processor and the device are unavailable for use by the running operating system.
A system is presented, comprising at least two processors; a memory operably coupled to the processor; an operating system operably installed in the memory; a software program operably installed in the memory for making a device unavailable to the operating system while the operating system is running; a software program operably installed in the memory for making a processor unavailable to the operating system while the operating system is running; and a software program operably installed in the memory for running a diagnostic test of the device, the processor executing the diagnostic test while the device and the processor are unavailable for use by the running operating system to determine whether the device is functioning acceptably.
An apparatus is presented, comprising means for making a device unavailable for use by a running operating system; means for making a processor unavailable for use by the running operating system; and means for running a diagnostic test of the device, the processor executing the diagnostic test to determine whether the device is functioning acceptably, while the processor and the device are unavailable for use by the running operating system.