1. Field of the Invention
The present invention relates generally to computer systems and, more particularly, to methods and apparatus for disabling a processor in a multiprocessor computer.
2. Description of the Related Art
About two decades ago, a relatively compact and basic computing device, which would come to be known as the personal computer or PC, was under development. Like all personal computers since, these early personal computers utilized microprocessors coupled to various types of memory devices. However, due to the extremely limited computing capabilities of these early microprocessors and the limited size and costliness of high speed memory, these early personal computers truly were nothing but stand alone personal computing devices.
In the intervening years, microprocessors, memory devices, software, and many other portions of a computing system have seen rapid improvements in speed, capacity, complexity, and performance. By way of example, the latest generation microprocessors from Intel Corporation include the Pentium, Pentium Pro, and Pentium II Xeon (Slot-2) microprocessors. These processors are so powerful that they not only would have been considered an unbelievable evolution over the Z80 and 8080 microprocessors of two decades ago, but they also offer significant enhancements over the prior generation 486 processors. Even in view of this rapid and incredible improvement of microprocessors, the resource requirements of software are always increasing, as are the variety of uses for xe2x80x9cpersonalxe2x80x9d computers. These needs, in turn, drive the need for the design and development of ever more powerful and efficient computer systems.
In view of these vast technological improvements, personal computers have made great strides from their humble beginnings to provide solutions for the ever expanding needs and desires of the computing public. For example, two decades ago, virtually all large or complicated computing operations, from data processing to telephone networks, were handled by large mainframe computers. However, networks of microprocessor-based personal computers have made tremendous inroads into areas that were once the exclusive domain of such large mainframe computers. Such networks of personal computers provide the computing power and centralized access to data of mainframe systems, along with the distributed computing capability of stand alone personal computers. These networks typically include tens, hundreds, or even thousands of personal computers, including powerful personal computers that can act as servers. Indeed, as such networks have become larger and more complex, there has been a need for improving the computing performance of servers on the network. To address this need, multiple processors are now being used in personal computers which are configured to act as servers in order to produce more powerful servers.
The expansion of microprocessor-based personal computers into the mainframe domain, however, has not been problem free. Mainframe computers have historically been designed to be reliable and extremely fault tolerant. In other words, a failure of a portion of the mainframe computer does not typically result in lost or corrupted data or extensive down time. Because personal computer networks are increasingly being used instead of mainframe systems, users are demanding that such networks provide fault tolerance similar to that found in the mainframe systems.
In view of these user demands, manufacturers have devised various ways for providing fault tolerance in personal computer networks. Many of these developments have concentrated on the fault tolerance of the servers in a personal computer network, because servers are typically the cornerstone of most networks. In other words, because the servers typically provide applications, data, and communications among the various workstations, the failure of one server could cause the entire network to fail.
In a multiprocessor computer such as those typically used as servers, one problem that may occur involves the failure of one of the multiple processors. Because of this possibility, a fault-tolerant scheme should include the ability to detect when one of the multiple processors has failed. Current fault detection schemes of this type typically attempt to determine whether a processor has failed during the power up sequence. For example, one method of booting a multiprocessor computer involves the assignment of a primary processor, typically called a boot processor, which is responsible for activating the remainder of the computer system. Once the boot processor has been successfully started, the boot processor then tests the remaining processors and various other components in the computer system. While this scheme facilitates the detection of a failed secondary microprocessor, it does not address a situation where the boot microprocessor fails. In such a situation, the boot microprocessor would be unable to activate the secondary processors, leaving the entire server incapacitated even though one or more secondary processors may remain fully operational.
In an effort to address this problem, one technique utilizes a timer associated with the processors, along with specialized hardware to determine the hierarchy of the multiple processors. When the system is reset, the boot processor is initialized by the hardware and activated to boot the remainder of the computer system including the secondary processors. However, if the boot processor does not take certain actions within the period set by the timer, the timer expires and sends a signal to the hardware to cause the hierarchy of the multiple processors to be changed. Thus, one of the secondary processors becomes the boot processor, and it attempts to activate the computer system. This process, which is typically referred to as a hot spare boot, continues until a successful boot operation occurs.
Although this type of technique may be quite satisfactory in many circumstances, shortcomings do exist. Of primary concern is the method conventionally used to exclude a processor from the boot process. Currently, the xe2x80x9cFLUSH#xe2x80x9d pin on the processor is asserted during a reset to cause a failed processor to shut itself off. After a reset, when the processor samples this pin and determines that the FLUSH# signal has been asserted, the processor""s pins are placed in a high impedance state or tristate mode so that the processor xe2x80x9cplays deadxe2x80x9d during the rest of the system""s normal operations. While this method appears to be quite satisfactory, as mentioned above, the inventors have questioned the reliability of this method if the processor has internal failures. For instance, a failing processor may not be able to sample the FLUSH# pin, and, if it can, it may not be able to operate properly to remove itself from operation. Thus, a problem with this conventional method is that it relies on a failing processor to (a) interpret an incoming signal and (b) to perform the appropriate action to remove itself from operation.
The present invention may address one or more of the problems set forth above.
Certain aspects commensurate in scope with the disclosed embodiments are set forth below. It should be understood that these aspects are presented merely to provide the reader with a brief summary of certain forms the invention might take and that these aspects are not intended to limit the scope of the invention. Indeed, the invention may encompass a variety of aspects that may not be set forth below.
In one embodiment, a computer has a plurality of processors, each of which are powered by a respective voltage regulator module (VRM). During a power on sequence, one of the processors is designated as a boot processor, which is responsible for booting the remaining processors. If the boot processor is operating correctly, it delivers a signal to stop an associated timer, and it boots the computer. However, if the boot processor is not able to boot the computer, the computer resets itself. Specifically, in this embodiment, the timer associated with the boot processor times out and delivers a signal to control logic if the boot processor does not boot the computer within a given time period. In response to this signal, the control logic delivers a signal to the VRM associated with the boot processor. The signal causes the VRM to discontinue supplying power to the boot processor, thus disabling the boot processor. This process may continue until one of the processors successfully boots the computer.
In another embodiment, a computer has a plurality of processors, each of which are powered by the computer""s power supply via a respective transistor. During a power on sequence, one of the processors is designated as a boot processor, which is responsible for booting the remaining processors. If the boot processor is operating correctly, it delivers a signal to stop an associated timer, and it boots the computer. However, if the boot processor is not able to boot the computer, the computer resets itself. Specifically, in this embodiment, the timer associated with the boot processor times out and delivers a signal to control logic if the boot processor does not boot the computer within a given time period. In response to this signal, the control logic delivers a signal to the transistor associated with the boot processor. The signal turns off the transistor to discontinue the supply of power to the boot processor, thus disabling the boot processor. This process may continue until one of the processors successfully boots the computer.