1. Technical Field
The present invention relates generally to data processing systems and in particular to hot-pluggable components of data processing systems. Still more particular the present invention relates to a method, system and data processing system configuration that enable non-disruptive, automatic detection and hot-removal of hot-pluggable problem components of a data processing system.
2. Description of the Related Art
The need for better and more resourceful data processing system in both the personal and commercial context has led the industry to continually improve the systems being designed for customer utilization. Generally, for both commercial and personal systems, improvements have focused on providing faster processors, larger upper level caches, greater amounts of read only memory (ROM), larger random access memory (RAM) space, etc.
Meeting customer needs have also required enabling the customer to enhance and/or expand an already existing system with additional resources, including hardware resources. For example, a customer with a computer equipped with a CD-ROM may later decide to “upgrade” to or add a DVD drive. Alternatively, the customer may purchase a system with a Pentium 1 processor chip with 64K byte memory and later decide to upgrade/change the chip to a Pentium 3 chip and increase memory capabilities to 256K-byte
Current data processing systems are designed to allow these basic changes to the system's hardware configuration with a little effort. As is known by those skilled in the art, upgrading the processor and/or memory involves removing the computer casing and “clipping” in the new chip or memory stick in a respective one of the processor decks and memory slots available on the motherboard. Likewise the DVD player may be connected to one of the receiving internal input/output (I/O) ports on the motherboard. With some systems, an external DVD drive may also be connected to one of the external serial or USB ports.
Additionally, with commercial systems in particular, improvements have also included providing larger amounts of processing resources, i.e., rather than replacing the current processor with one that is faster, purchasing several more of the same processing systems and linking them together to provide greater overall processing ability. Most current commercial systems are designed with multiple processors in a single system, and many commercial systems are distributed and/or networked systems with multiple individual systems interconnected to each other and sharing processing tasks/workload. Even these “large-scale” commercial systems, however, are frequently upgraded or expanded as customer needs change.
Notably, when the system is being upgraded or changed, particularly for internally added components, it is often necessary to power the system down before completing the installation. With externally connected I/O components, however, it may be possible to merely plug the component in while the system is powered-up and running. Irrespective of the method utilized to add the component (internal add or external add), the system includes logic associated with the fabric for recognizing that additional hardware has been added or simply that a change in the system configuration has occurred. The logic may then cause a prompt to be outputted to the user to (or automatically) initiate a system configuration upgrade and, if necessary, load the required drivers to complete the installation of the new hardware. Notably, system configuration upgrade is also required when a component is removed from the system.
The process of making new I/O hardware almost immediately available for utilization by a data processing system is commonly referred to in the art as “plug and play.” This capability of current system allows the systems to automatically allow the component to be utilized by the system once the component is recognized and the necessary drivers, etc. for proper operation is installed.
FIG. 1A illustrates a commercial SMP comprising processor1 101 and processor2 102, memory 104, and input/output (I/O) devices 106, all connected to each other via interconnect fabric 108. Interconnect fabric 108 includes wires and control logic for routing communication between the components as well as controlling the response of MP 100 to changes in the hardware configuration. Thus, new hardware components would also be connected (directly or indirectly) to existing components via interconnect fabric 108.
As illustrated within FIG. 1A, MP 100 comprises logical partition 110 (i.e., software implemented partition), indicated by dotted lines, that logically separates processor1 101 from processor2 102. Utilization of logical partition 110 within MP 100 allows processor1 101 and processor2 102 to operate independently of each other. Also, logical partition 110 substantially shields each processor from operating problems and downtime of the other processor.
Commercial systems, such as SMP 100 may be expanded to meet customer needs as described above. Additionally, the changes to the commercial system may be as a result of a faulty component that causes the system to not operate at full capacity or, in the worst case, to be in-operable. When this occurs, the faulty component has to be replaced. Some commercial customers rely on the manufacturer/supplier of the system to manage the repair or upgrade required. Others employ service technicians (or technical support personnel), whose main job it is to ensure that the system remains functional and that required upgrades and/or repairs to the system are completed without severely disrupting the ability of the customer's employees to access the system or the ability of the system to continue processing time sensitive work.
In current systems, if a customer (i.e., the technical support personnel) desires to remove one processor (e.g., processor1 101) from the system of FIG. 1A, the customer has to complete the following sequence of steps:
(1) The instructions are stopped from executing on processor1 101, and all the I/O is suppressed;
(2) A partition is imposed between the processors;
(3) Then, the system is shut down (powered off). From the customer's perspective, an outage is seen since the system is not available for any processing (i.e., even operations on processor2 102 are halted);
(4) Processor1 101 is removed, the system is powered back on; and
(5) The system (processor2 102) is then un-quiesced. The un-quiesce process involves restarting the system, rebooting the OS, and resuming the I/O operations and the processing of instructions.
Likewise, if the customer desires to add a processor (e.g., processor1 101) to a system having only processor2 202, a somewhat reversed sequence of steps must be followed:
(1) The instructions are stopped from executing on processor2 102, and all the I/O is suppressed. From the customer's perspective, an outage is seen since the system is not available for any processing (i.e., operations on processor2 102 are halted).
(2) Then, the system is shut down (powered off).
(3) Processor1 101 is added and the system is powered back on; Processor1 101 is initialized at this point. Initialization typically involves conducting a series of tests including built in self test (BIST), etc.;
(4) The system is then un-quiesced. The un-quiesce process involves restarting the system and resume the I/O operations and resuming processing of instructions on both processors.
With large-scale commercial systems, the above 5-step and 6-step processes can be extremely time intensive, requiring up to several to hours to complete in some situations. During that down-time, the customer cannot utilize/access the system. The outage is therefore very visible to the customer and may result in substantial financial loss, depending on the industry or specific use of the system. Also, as indicated above, a mini-reboot or full reboot of the system is required to complete either the add or remove process. Notably, the above outage is experienced with systems having actual physical partitions as well, which is described below.
FIG. 1B illustrates a sample MP server cluster with physical partitions. MP server cluster 120 comprises three servers, server1 121, server2 122, and server3 123 interconnected via backplane connector 128. Each server is a complete processing system with processor 131, memory 136, and I/O 138, similarly to MP 100 of FIG. 1A. A physical partition 126, illustrated as a dotted line, separates server3 123 from server1 121 and server 2 122. Server1 121 and server2 122 may be initially coupled to each other and then server3 123 is later added. Alternatively, all servers may be initially coupled to each other and then server3 123 is later removed. Irrespective of whether server3 123 is being added or removed, the above multi-step process involving taking down the entire system and which results in the customer experiencing an outage is the only known way to add/remove server3 123 from MP server cluster 120.
Removal of a server or processor from a larger system is often triggered by that component exhibiting problems while operating. These problems may be caused by a variety of reasons, such as bad transistors, faulty logic or wiring, etc. Typically, when a system/resource is manufactured the system is taken through a series of tests to determine if the system is operating correctly. This is particularly true for server systems, such as those described above in FIG. 1B. Even with near 100 percent accuracy in the testing, some problems may not be detected during fabrication. Further, internal components (transistors, etc.) often go bad some time after fabrication, and the system may be shipped to the customer and added to the customer's existing system. A second series of test are usually carried out on the system when it is connected to the customer's existing system to ensure that the system being added is operating within the established parameters of the existing system. The later sequence of tests (customer-level) are initiated by a technician (or design engineer), whose job is to ensure the existing system remains operational with as little down time as possible.
In very large/complex systems, the task of running tests on the existing and newly added systems often takes up a large portion of the technician's time and when a problem occurs, the problem is usually not realized until some time after the problem occurs (perhaps several days). When a problem is found with a particular resource, that resource often has to be replaced. As described above, replacing the resource requires the technician take down the entire system, even when the resource being replaced/removed is logically or physically partitioned off from the remaining system.
A problem component that is sharing the workload of the system may result in less efficient work productions than the system without that component. Alternatively, the problem component may introduce errors into the overall processing that renders the entire system ineffective. Currently, removal of such components requires a technician to first conduct a test of the entire system, isolate which component is causing the problem and then initiate the removal sequence of steps described above. Thus, a large part of system maintenance requires the technician to continually run diagnostic tests on the systems, and system monitoring consumes a large number of man-hours and may be very costly to the customer. Also, problem components are not identified until the technician runs the diagnostic and the problem component may not be identified until it has corrupted the operation being processed by the system. Some processing results may have to be discarded, and the system may have to be backed up to the last correct state.
The present invention recognizes that it would be desirable to enable a system to automatically identify major hot-pluggable hardware components that exhibit operating problems and dynamically respond to a problem component by directing operations from the problem component to other functional components. A system and method that enables the automatic removal of the problem component in a manner that is invisible to the customer but which automatically alerts the customer of the existence and removal of the problem component would be a welcomed improvement. These and other benefits are provided by the invention described herein.