1. Field of the Invention
This invention relates to a method of monitoring errors occurring on field replaceable units. More specifically, the invention relates to a method of monitoring errors occurring on field replaceable units, typically those inside storage system cabinets. The invention also relates to a system for conducting such monitoring of errors which occur on field replaceable units housed inside of cabinets, and controlled by storage processors.
2. Description of Related Art
Host systems attached to field replaceable units, e.g., such as storage systems, have in the past been required to detect errors that occur on the storage system to allow servicing of the storage system, and to allow a network controlled by the host system to continue to operate in a manner desired. By the term xe2x80x9chost systemxe2x80x9d is typically meant a server which may be connected to multiple storage systems such as those available from EMC Corporation under the trademarks Symmetrix(trademark), Clariion(trademark), etc. In such arrangements, it is important that in the event of an error occurring at a device such as a storage system, that the host system be made aware of the error so that an alert can be dispatched to a service center, which can then service the storage system in which the error occurred to ensure that the network continues to operate smoothly.
More specifically, such storage systems have typically included two storage processors dedicated to controlling the operation of various components of the storage system, and also providing the function of keeping an error log, each independently, so that that error log can be periodically checked by the host system, and if serious error is detected, an alert issued to a service center.
Current designs for monitoring such errors involve, for example, the host or server using the xe2x80x9cdev.axe2x80x9d (block I/O) device driver to continuously poll all the storage processors and deliver all error messages to the host management software. Unfortunately, this technique results in many duplicate messages being reported and also results in degraded system performance because unnecessary input and output, i.e., I/O, is done with the storage processors on the storage system.
In an alternative system the storage processors at the storage system directly perform xe2x80x9ccall outsxe2x80x9d on errors to a customer service center without reporting the call out to the host system. A problem with this approach is that there is no way that the host system can track the errors such that the operator of the host system is made aware of recurring errors which may require unique and unusual intervention. In addition, the host""s view of available paths to a storage processor is obtained.
Still another approach is to provide an auxiliary service processor used to monitor errors in external storage devices using an I2C bus. The errors are stored in non-volatile random access memory, i.e., NVRAM, on the server""s processor""s error log, and are accessible to the host system through the use of an adapter by calling into the service processor using a modem or network adapter.
Accordingly, in accordance with the method and system described herein, the disadvantages of prior art systems are avoided, and an efficient method and system of monitoring errors without duplication of reporting and degrading of throughput resulting, while still allowing the host system to maintain an accurate record of errors occurring at remote storage systems such that the host system can control which errors are reported to a service center.
In one aspect there is provided a method of monitoring errors occurring on field replaceable units in an external cabinet having at least one storage processor. The method includes a first step of reading any error occurring on a field replaceable unit in each cabinet which has a storage processor, and in which the error has been entered into at least one error log by the storage processor. For purposes of this disclosure, examples of field replaceable units include the power supplies, disks, fans, controller boards, memory, or other components which can incur errors, and which are replaceable on service calls by field service technicians. At predetermined intervals, all storage processors connected to the host system are identified by the host system. For each connected storage processor, previous information about the storage processor and its corresponding error log is loaded. A path is selected from the host system to each storage processor, and all field replaceable units are inventoried through each connected storage processor. Each storage processor""s pointer is then updated by retrieving all entries in each storage processor""s corresponding error log, and new error log entries are detected in each error log and matched with its corresponding field replaceable unit, whereby an alert can be transmitted to a customer service center in the event an error is recorded on the respective error logs.
In a further aspect, if a path cannot be established for any storage processor, there occurs an attempt to re-establish the path after a predetermined amount of time has elapsed. If the path cannot be re-established after the predetermined period of time, it is then determined if there is another path available. If there is no other path available, an alert is transmitted to a customer service center. Alternatively, if there is another path available, another path is selected from the host system to the field replaceable unit.
In an alternative aspect, there is described a host system for monitoring errors occurring on field replaceable units. Each field replaceable unit is of the type controlled or interacting with at least one storage processor capable of recording errors in an error log associated with the storage processor. The storage processors are connectable to the host system. The host system includes a monitor agent programmed for identifying at predetermined intervals all storage processors connected to the host system, and for each connected storage processor, loading previous information about the storage processor and its corresponding error log. The monitor agent in the host system is further programmed for selecting a path to each storage processor to which it is connected for conducting an inventory of all its field replaceable units. The monitor agent further serves to update each storage processor""s pointer in the host system by retrieving old entries in each storage processor""s corresponding error log, and for detecting new error log entries in each error log, for matching the error log entry corresponding field replaceable unit such that the host system can be instructed to transmit an alert to a customer service center in the event a new error is recorded on an error log.
Yet still further, the monitor agent can be programmed such that for any connected storage processor, if the path cannot be established, the host system attempts to re-establish the path after a predetermined period of time, and if the path cannot then be re-established after that period of time, determining if there is another path available. If there is another path available, the host system selects another path to the field replaceable unit. If there is no other path available, the host system is programmed to transmit an alert to a customer service center.