1. Field of the Invention
The present invention generally relates to computer systems, and more specifically to a method of upgrading or servicing computer components, particularly processing units in a multiprocessor computer system, without powering down the computer system or otherwise interrupting service.
2. Description of Related Art
Modern computing systems are often constructed from a number of processing units and a main memory, connected by a generalized interconnect. The basic structure of a conventional multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has several processing units (CPUs) 12a, 12b, and 12c which are connected to various peripheral, or input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (random-access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on.
Processing units 12a-12c communicate with the peripheral devices, memory and firmware by various means, including a bus 20. Computer system 10 may have many additional components which are not shown, such as serial and parallel ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video-display monitor, a memory controller can be used to access memory 16, etc. The computer can also have more or less than three processing units. In a symmetric multiprocessor (SMP) computer, all of the processing units 12a-12c are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture.
Conventional computer systems often allow the user to add various components after delivery from the factory. For peripheral devices, this can be accomplished using an xe2x80x9cexpansionxe2x80x9d bus, such as the Industry Standard Architecture (ISA) bus or the Peripheral Component Interconnect (PCI) bus. Another component that is commonly added by the user is main memory. This memory is often made up of a plurality of memory modules that can be added or removed as desired. Even processing units can be added or swapped out, in more recent computer designs.
Processing units can be added to upgrade a system, or to replace older units that have become defective. A processing unit has typically has several execution units, one or more dispatch units, branch units, load/store units, and arithmetic units such as floating point and fixed point units, along with several types of registers used to hold data. In addition, a processing unit usually has on-board instruction and data caches, and various other features such as a translation lookaside buffer. Consequently, as the architectures within the processing units improve, it is desirable to upgrade to the better (faster) architectures, and higher processor clock speeds. A processing unit may be added using a connector (socket) mounted on a processor card or on the primary circuit board (xe2x80x9cmotherboardxe2x80x9d) of the computer system. These sockets include retention tabs that latch to allow the easy removal of the processors from the computer system, while securely retaining them in the sockets against vibrations, shocks, or inadvertent removal.
When a user desires to upgrade or service the system, the computer must generally be powered down prior to addition or replacement of a processor. After the maintenance is performed, the computer is re-started, and the basic input-output system (BIOS) residing in the firmware tests the processors, and makes them available to the operating system which is thereafter loaded by the firmware.
For many computer systems (particularly large servers used in a client-server network), there may be hundreds of users connected to it, and the down time required to perform a processor upgrade and service can be extremely expensive. Also, in systems which are used in mission-critical applications, it is highly desirable to be able to perform a processor upgrade and service operation without service interruption, particularly when it is necessary to replace defective processors.
One approach to completely avoiding the need for processor service is to simply provide extra, redundant processors so that any defective unit may be idled without noticeable affect on the system. This approach is, of course, extremely inefficient and very expensive, especially if the basic system operation requires a large number of processors, necessitating a large number of redundant processors. Also, this approach cannot be used to avoid service interruption for processor upgrade. It would, therefore, be desirable to provide a method of upgrading or servicing a system""s processors without requiring a powering down or interruption of the system, and which further did not require redundant processors that are so wasteful.
It is therefore one object of the present invention to provide an improved method of upgrading and servicing components of a computer system.
It is another object of the present invention to provide such a method that allows the processing units of the computer system to be individually upgraded or serviced, without interrupting system operation.
It is yet another object of the present invention to provide such a method that does not require the inefficient use of redundant processors.
The foregoing objects are achieved in a method of providing maintenance for a processor array of a multiprocessor computer system, generally comprising the steps of quiescing a processor in the processor array selected for maintenance by completing program instructions assigned to the selected processor without assigning new instructions to the selected processor, removing the selected processor from a processor pool used by an operating system of the computer system, and powering down the selected processor while maintaining power to and operation of at least one other processor in the processor array. The selected processor may be identified as being defective, or may just have been selected for upgrading. In one implementation, the processor array includes a plurality of processor clusters, each cluster having at least two processors, and the quiescing, removing and powering down steps apply to all processors in one of the processing clusters. After servicing the processor, the cluster can be powered up and added back to the processor pool.
The processor service program may be embedded in the operating system (OS). The OS assigns one of the processors in the processor array to be a service processor, and the quiescing step is carried out by the service processor modifying an instruction distribution mechanism to exclude use of the selected processor. If the service processor is the processor selected for maintenance, the OS re-assigns the service processor functions to another processor in the processor array.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.