The present invention relates to fault location and, in particular, pertains to fault location in passive transmission lines that transmit data from one component to another.
In today""s fault tolerant systems, a continuous passive transmission line can be used to transmit information or data from one component to another. When a fault occurs, e.g. when a component fails or when the transmission line fails in a particular location, the faulty component or transmission line must be replaced. Before the replacement occurs, however, the subsystem typically must respond to the fault condition, e.g. by implementing a remedial recovery process. In order to respond to the fault condition, the subsystem will likely be required to make assumptions about the nature and location of the fault. Many times, though, the nature and location of the fault cannot easily be ascertained, thereby complicating the assumption-formulating process undertaken by the subsystem. Assumptions that are incorrect can have adverse consequences. For example, some systems might experience a loss of data availability. Additionally, incorrect assumptions can cause the wrong components to be replaced thereby requiring extra time, effort and resources of those who are responsible for servicing such systems.
As an example, consider so-called high reliability computer disk storage systems. In high reliability computer disk storage systems, there is a desire to have redundancy in all of the physical parts that make up a subsystem to reduce the potential for loss of data and down time upon failure of a part or component. The use of dual disk storage controllers, each having its own memory, provides several major benefits to a disk storage system. For example, (1) a redundancy of storage information is retained to allow for recovery in the case of failure or loss of one controller or its memory; (2) recovery from a disabled controller is feasible due to the failover capabilities of the secondary controller; and (3) greater system up time is achieved through the secondary controller being available.
With the desire for more performance out of these redundant subsystems, caching and the use of memory as temporary storage has become commonplace. The means by which these duplicate physical memories are kept in synchronization can be difficult. Some disk systems use a latent (delayed or massive update) process to create this duplication, but that approach tends to degrade performance and is very complex to manage. Another approach is to form a real time mirrored memory process to create and accurately retain this duplication of data. The use of real time, synchronized, redundant memory (mirrored memory) in dual controllers can improve speed and accuracy in the case of a failover from one controller to the other.
However, this use of redundant memory makes the problem of providing multiple disk storage controller solutions substantially more difficult. Some of the problems that arise in these types of systems include how to effectively and reliably (1) detect controller failure early on in the context of mirrored memory processing so as to reduce potential problems that may occur from a later discovery of the failure; (2) detect controller failure without significant hardware and/or software overhead requirements; and (3) detect controller failure to separate the controllers and discontinue mirroring of their memories without loss of processing operations and capabilities.
In current AUTORAID subsystem implementations, a significant number of the signals that get passed between controllers are used for mirror traffic. If the interface between the controllers is faulty, the controller electronics can no longer support the mirroring function between the two boards that support the controllers. When the mirror function is not operating, then only one controller board can run in the system. The other controller board will not have access to the proper memory image, so it must discontinue operation.
FIG. 1 shows an exemplary dual controller system 10 that includes a controller A and a controller B. Each controller includes memory of some type (here illustrated as a DRAM) and a memory controller that is configured to manage and control the memory for each controller. As the xe2x80x9cXxe2x80x9d in the figure indicates, faults can occur in any one of three locations, i.e. on or in each of the controllers, or somewhere in the connecting mid-plane. The operation of exemplary dual controller disk storage systems is described in the following U.S. Patents, assigned to the assignee of this document and incorporated by reference herein: U.S. Pat. Nos. 5,928,367, 5,699,510, 5,553,230, and 5,856,989.
Given today""s available technology, the hardware subsystem is not able to determine the exact location of a component that has failed. If the controllers decide to continue operation with controller A, and controller A is the one that has failed, then the state of the machine will advance on a component that must ultimately be replaced. The incorrect assumption or decision on the part of the subsystem will become apparent when service personnel first attempt a repair by replacing controller B. When this controller is replaced, the mirror interface will continue to be faulted since controller A has yet to be repaired. At this point, the service procedure will require that the subsystem be shut down and that all important information be stored on the disk drives. This shut down procedure will cause a loss of data availability. Controller A will then be replaced and the required information will be restored from the disk drives.
In a worst-case situation, even more availability to data is lost when the mid-plane (or medium that connects the two controllers) is faulty. In this case, the fault is still present in the system when the second controller is replaced. If both controllers are replaced and the problem persists, then the mid-plane can be assumed to be the faulty component. Again, the system must be shut down and the necessary data stored to disk drives. It would be far more desirable to know where the fault resides so that the proper action can be taken by the repair personnel at the start of the repair procedure.
Accordingly, this invention arose out of concerns associated With improving fault location in various systems, and particularly those systems that comprise mirrored memory dual controller disk storage systems.
Methods and systems for fault location are described. In one described embodiment, an xe2x80x9cin circuitxe2x80x9d solution is provided for locating faults along a passive transmission line. Once a fault occurs, various hardware gathers information that is necessary to determine which of a number of different replaceable components has failed. This enables the subsystem to properly respond to the fault condition and thereby eliminate any guessing that could potentially lead to loss of data availability.
In the particular described embodiment, signals are driven and received through a selected input/output (I/O) pad. Logic circuitry is provided and launches a wave onto the passive transmission line. Immediately following the launching of the wave, the I/O pad is monitored and can sense the reflections from the wave that has just been launched. By analyzing the reflections, and more specifically the time that it takes for the reflection to be sensed, a determination is made as to the fault location. Once the fault location (or distance thereto) is ascertained, a determination can be made as to which component has failed. At this point, an intelligent decision can be made as to which component should continue operation.
In one embodiment, a fault-detection application specific integrated circuit (ASIC) is provided and comprises wave-generating circuit means configured to generate a wave that can be propagated along a transmission line along which information or data is transmitted. Reflection-sensing circuit means are configured to receive a wave that has been reflected because of a fault encountered by the generated wave and determine a propagation time associated with the reflected wave.
In another embodiment, a fault location system comprises multiple controllers that are configured to produce and transmit data. Connection media is provided and communicatively links the multiple controllers with one another. At least one wave propagator is configured to generate a wave that can be propagated along the connection media. At least one wave reflection sensor is configured to sense a propagated wave that has been reflected because of encountering a fault in its propagation path. At least one fault locator is configured to ascertain the location of a fault based upon the reflected wave that is sensed by the one wave reflection sensor.
In yet another embodiment, a fault location method comprises launching a wave along a transmission line that is configured to carry data. A reflected wave is received that is reflected by a fault that is encountered by the launched wave in its propagation path. An amount of time is determined that is associated with the time between launching the wave and receiving the reflected wave. From the amount of time, a location of the fault that was encountered by the launched wave is ascertained.