The invention is generally related to computers and computer software. More specifically, the invention is related to concurrent maintenance of computers and the like.
Computer downtime, or the period of time in which a particular computer is unavailable for use, often raises significant concerns in a number of computer applications. For single-user computers, computer downtime may only inconvenience the primary users of the computers. However, for multi-user computers such as mainframe computers, midrange computers, supercomputers, network servers, and the like, the inability to use a particular computer may have a significant impact on the productivity of a relatively large number of users, particularly in mission-critical applications. A large proportion of multi-user computers are used around the clock, and as a result, it is often critically important that these computers be available as much as possible.
However, multi-user computers, like anything else, need to be maintained from time to time. Components may fail and need replacement. Also, as the workload of a computer increases, additional components may need to be added. Furthermore, as technology advances, new and improved components may become available. With many conventional computers, however, many of these operations require that the computers be shut down and made unavailable while maintenance is being performed.
To address the problems associated with computer downtime, significant development efforts have been made in the area of concurrent maintenance. Concurrent maintenance is a process by which maintenance of a computer occurs while the computer is running, and with minimal impact on user accessibility.
For example, a number of computer interfaces have been proposed and/or implemented in the area of xe2x80x9chot swappabilityxe2x80x9d, whereby components may be installed and/or removed from a computer without having to shut down and/or restart the computer. For example, a Peripheral Component Interconnect (PCI) hot plug specification has been defined to permit electronic components to be installed and/or removed from a PCI bus implemented in a computer.
A PCI bus is typically a high speed interface between the processing complex of a computer and one or more xe2x80x9cslotsxe2x80x9d that receive printed circuit boards known as interface or adapter cards. The cards typically control hardware devices that are either disposed on the cards or are coupled thereto through dedicated cabling. Any number of hardware devices may be coupled to a computer in this manner, including computer displays, storage devices (e.g., disk drives, optical drives, floppy drives, and/or tape drives), workstation controllers, network interfaces, modems, and sound cards, among others. The PCI hot plug specification permits individual slots on a PCI bus to be selectively powered off to permit cards to be removed from and/or installed into the slots.
One problem, however, with the PCI hot plug specification, as well as other concurrent maintenance implementations, is that often additional steps such as manual reconfiguration and/or partial or total system restart are required. Specifically, updates are often required to the computer programs that function as the interfaces between the computer and various hardware devices.
Using such interface computer programs, for example, enables the complexity and specifics of a particular hardware device to be effectively hidden from another computer program wishing to use the device. In many environments, the computer programs that interface hardware devices with computers are referred to as xe2x80x9cresourcesxe2x80x9d (which are also referred to in some environments simply as hardware drivers, device drivers, or input/output (I/O) drivers, among others). Often a resource is implemented within the operating system of the computer, and thus resides between the hardware devices and the computer applications that use such hardware devices.
By using a resource to interface a hardware device with a computer, a computer application that wishes to access the hardware device can do so through a common set of commands that are independent of the underlying specifics of the hardware device. For example, a resource associated with a disk drive controller may provide a set of commands such as xe2x80x9copen filexe2x80x9d, xe2x80x9cread dataxe2x80x9d, xe2x80x9cwrite dataxe2x80x9d or xe2x80x9cclose filexe2x80x9d that can be called by any computer application that wishes to perform an operation on a disk drive coupled to the controller. It does not matter to the computer application that the disk drive controller is installed in slot 3 or slot 4, or that the controller adheres to the Small Computer Systems Interface (SCSI) or Integrated Drive Electronics (IDE) standard to transmit information between the disk drive and the controller. Moreover, if the computer application wishes to access another disk drive, the same set of generic commands may often be used even if the other disk drive is significantly different from the first.
However, different hardware devices typically do require specific operations to be performed in response to the generic commands issued by a computer application. Thus, a resource is often required to perform device-specific operations for a particular device in order to handle a generic command requested by a computer application. In conjunction with these tasks, the resource typically maintains device-specific information such as the location of the hardware device, the type of device, and other device characteristics.
Typically, a resource has, among other information, some form of indication that identifies the resource to the computer applications, generally referred to herein as a resource identifier. A resource may also have some form of indication as to where in the computer the hardware device associated with the resource is located (e.g., at a particular bus location, in a particular slot, etc.), also referred to herein as a location identifier. Furthermore, a resource may have some form of indication that uniquely identifies the hardware device associated with the resource to distinguish that device from other devices that may or may not be installed in the computer, also referred to herein as a device identifier.
Conventional concurrent maintenance implementations typically have no manner of automatically reconfiguring a resource in response to a change in the status of the hardware device associated with the resource. Therefore, when a hardware device is installed, removed or replaced, any resource associated with the hardware device often must be manually reconfigured by a system operator (e.g., by manually updating one or more system configuration files associated with the resource). Often, this also requires individual computer applications that rely on a resource to also be manually reconfigured. Such reconfigurations often require the resource and/or computer applications relying on the resource to be temporarily inaccessible to users, thereby extending the downtime associated with conventional concurrent maintenance implementations. Otherwise, automatic reconfiguration may be supported, but only after the computer, or at least the operating system of the computer, is restartedxe2x80x94a process that can often be slow and time consuming.
Another problem associated with conventional concurrent maintenance implementations is that often failure of a particular hardware device can prevent initiation of and/or performance of concurrent maintenance operations. For example, in some conventional implementations, some user interaction through a display or terminal user interface is required to perform operations such as powering down or powering up a bus or slot therein to which a particular hardware device is attached. For single-user computers, for example, the display user interface may be a computer monitor that displays information to a user. For a multi-user computer, the display user interface may be a separate workstation or terminal that is interfaced with the computer.
In many computers, however, failure of some hardware devices may cause some functions in the computers to xe2x80x9clock-upxe2x80x9d, or halt operation, as a result of uncompleted accesses to failed hardware devices. For example, some computers may not permanently maintain in main storage the program code necessary to operate the display user interface. Instead, such program code may be permanently maintained in an external storage device and swapped into and out of main storage from time to time as needed by the computer, a process generally known as xe2x80x9cpaging.xe2x80x9d Whenever program code is stored in the main storage, such program code is also referred to as being xe2x80x9cresidentxe2x80x9d in the computer.
Whenever a hardware device associated with such an external storage device (e.g., a controller) fails, it may not be possible to xe2x80x9cpage inxe2x80x9d the program code for operating the display user interface. As a result, it may not be possible to interface with the computer through the display user interface. Any concurrent maintenance operation that is accessed through the display user interface of the computer, therefore, could not be initiated, and the computer would be irretrievably locked-up, requiring a time consuming full restart of the computer. In addition, with some computers, restarting the computer after a lock-up condition (often referred to as an xe2x80x9cabnormal shutdown) may even take longer than after a normal shutdown, as processing must often be performed to restore the computer to a coherent state (if possible), including storage management directory recovery, mirrored DASD synchronization, etc.
Therefore, a significant need exists for a manner of supporting concurrent maintenance in a computer without requiring manual reconfiguration and/or a time consuming system restart to update the resources utilized by computer applications executing in the computer, and/or the applications themselves. Moreover, a significant need exists for a manner of supporting such concurrent maintenance operations that is not reliant on non-resident program code, so that the availability of such operations is not compromised.
The invention addresses these and other problems associated with the prior art by providing an apparatus, program product and method of replacing a failed hardware device in a computer that rely solely on program code and/or other computer facilities that are ensured of being available in the computer during a concurrent maintenance operation, so that, even in the event that a failure occurs in such a hardware device, successful performance of the concurrent maintenance operation is ensured. For example, the initiation of power up and power down functions necessary to permit replacement of a failed device may be performed through a control panel or other similar facility in a computer that is continuously available when a computer is in a fully or partially active and powered-on state.
Furthermore, in some embodiments, the detection of and recovery from a failure in a hardware device may be implemented in a highly automated fashion. Specifically, a concurrent replacement operation may be supported that automatically detects a lock-up condition resulting from a failed attempt to access data using a failed hardware device. Then, upon replacement of the device with a suitable replacement device, a resource that was previously associated with the failed device may be automatically associated with the replacement device such that the failed attempt to access data may be automatically resumed, thereby automating the recovery from the lock-up condition.
In either instance, the amount of computer downtime required to perform a concurrent maintenance operation is minimized, thereby ensuring less interruption of service for users. Moreover, much of the manual configuration that would otherwise be required may be reduced or eliminated, thereby facilitating system maintenance.
Therefore, consistent with one aspect of the invention, a failed hardware device is replaced in a computer, with the failed hardware device having associated therewith a resource that interfaces the failed hardware device with at least one application executing in the computer. Power is removed from the failed hardware device in response to user input received through a control panel on the computer. After user replacement of the failed hardware device with a replacement hardware device, power is supplied to the replacement hardware device in response to user input received through the control panel. Moreover, the resource is automatically associated with the replacement hardware device after power is supplied to the replacement hardware device.
Consistent with another aspect of the invention, a failed controller for an external storage device coupled to a computer over a bus is replaced, with the failed controller having associated therewith a resource that presents a uniform interface to at least one application on the computer. A lock-up condition is detected in the computer resulting from a failed attempt to access data with the external storage device. In response to detection of the lock-up condition, a user is enabled to replace the failed controller with a replacement controller. After replacement of the failed controller with a replacement controller, the resource is automatically updated to associate the replacement controller with the resource, and after the resource is updated, the lock-up condition is recovered from by automatically resuming the failed attempt to access data with the external storage device.
These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.