1. Field of the Invention
The present invention relates generally to computer systems and, more particularly, to methods and apparatus for facilitating the removal or replacement of a bad processor.
2. Description of the Related Art
About two decades ago, a relatively compact and basic computing device, which would come to be known as the personal computer or PC, was being developed. Like all personal computers since, these early personal computers utilized microprocessors coupled to various types of memory devices. However, due to the extremely limited computing capabilities of these early microprocessors and the limited size and costliness of high speed memory, these early personal computers truly were nothing but stand alone personal computing devices.
In the intervening years, microprocessors, memory devices, software, and many other portions of a computing system have seen rapid improvements in speed, capacity, complexity, and performance. By way of example, the latest generation microprocessors from Intel Corporation include the Pentium, Pentium Pro, and Pentium II Xeon microprocessors. These processors are so powerful that they not only would have been considered an unbelievable evolution over the Z80 and 8080 microprocessors of two decades ago, but they also offer significant enhancements over the prior generation 486 processors. Even in view of this rapid and incredible improvement of microprocessors, the resource requirements of software are always increasing, as are the variety of uses for xe2x80x9cpersonalxe2x80x9d computers. These needs, in turn, drive the need for the design and development of ever more powerful and efficient computer systems.
In view of these vast technological improvements, personal computers have made great strides from their humble beginnings to provide solutions for the ever expanding needs and desires of the computing public. Over the course of the past twenty years, personal computers have become an indispensable part of everyday life. Virtually every business relies to some degree upon personal computer systems, and personal computers are now found in many homes. Indeed, personal computers control everything from stock market trading to telephone networks.
For example, two decades ago, virtually all large or complicated computing operations, from data processing to telephone networks, were handled by large mainframe computers. However, networks of microprocessor-based personal computers have made tremendous inroads into areas that were once the exclusive domain of such large mainframe computers. Such networks of personal computers provide the computing power and centralized access to data of mainframe systems, along with the distributed computing capability of stand alone personal computers. These networks typically include tens, hundreds, or even thousands of personal computers, including powerful personal computers that can act as servers. Indeed, as such networks have become larger and more complex, there has been a need for improving the computing performance of servers on the network. To address this need for more powerful servers, multiple processors are now being used in personal computers which are configured to act as servers.
The expansion of microprocessor-based personal computers into the mainframe domain, however, has not been problem free. Mainframe computers have historically been designed to be reliable and extremely fault tolerant. In other words, a failure of a portion of the mainframe computer does not typically result in lost or corrupted data or extensive down time. Moreover, mainframe computers have historically been very service friendly. In other words, mainframe computers may be upgraded or repaired, in many circumstances, without shutting down the computer. Because personal computer networks are increasingly being used instead of mainframe systems, users are demanding that such networks provide fault tolerance and serviceability similar to that found in the mainframe systems.
In view of these user demands, manufacturers have devised various ways for providing fault tolerance in personal computer networks. Many of these developments have concentrated on the fault tolerance of the servers in a personal computer network, because servers are typically the cornerstone of most networks. In other words, because the servers typically provide applications, data, and communications among the various work stations, the failure of one server could cause the entire network to fail.
In one network fault tolerance scheme, two servers operate independently of each other but are capable of handling an increased workload if one of the servers fails. In such a scheme, each server periodically transmits a xe2x80x9cheartbeatxe2x80x9d message over the network to the other server to indicate that the transmitting server is functioning properly. If the receiving server does not receive the heartbeat message within a predetermined time interval, then the receiving server concludes that the transmitting server has failed and seizes the workload of the transmitting server.
In regard to the individual multiprocessor computers that are typically used as servers, one problem that may occur involves the failure of one of the multiple processors. Because of this possibility, a fault-tolerance scheme should include the ability to detect when one of the multiple processors has failed. When a processor failure has been detected, it would also be desirable to detect which processor has failed so that the computer may discontinue use of the processor and rely on the remaining processors.
When a faulty processor or server has been detected and removed from operation, it would be desirable to repair or replace the faulty component with minimal disruption to the network. However, when a processor fails, the computer of which it is a part typically crashes. Thus, it must be taken out of service temporarily so that the failed processor may be replaced. Depending upon the redundancy and complexity of the computer system, such a temporary removal may have wide ranging effects, from slightly degrading the overall performance of the computer system to temporarily removing the computer system from service.
In addition to the unscheduled downtime caused by processor failures, it is typically desirable to upgrade a computer""s processors from time to time. Such upgrades must typically be scheduled during non-peak times in order to minimize the downtime or performance degradation of the networked computer system.
The present invention may address one or more of the problems discussed above.
Certain aspects commensurate in scope with the disclosed embodiments are set forth below. It should be understood that these aspects are presented merely to provide the reader with a brief summary of certain forms the invention might take and that these aspects are not intended to limit the scope of the invention. Indeed, the invention may encompass a variety of aspects that may not be set forth below.
In a multiprocessor computer, it may be desirable to remove or replace one or more of the processors for various reasons. As described herein, the computer may remain operative during processor removal or replacement. In a computer having a split bus design, for example, the bus to which the processor to be removed or replaced is coupled to is identified. The processes on the identified bus are interrupted and rescheduled, and all processors on the identified bus are placed into a sleep mode. The power to the processor to be removed or replaced is disconnected, and the user is informed that the processor may be removed or replaced. Once the processor has been removed or replaced, all processors on the identified bus are returned to normal operation.