The invention relates generally to mass storage systems, and in particular to a synchronized clock system for use in a disk drive controller.
In a typical, modem, high speed, high capacity mass storage disk drive controller system, such as the EMC Symmetrix disk drive controller, it is common to employ a plurality of processors, each running its own code and each having its own operation within the collective whole of the controller. The controller can have, for example, sixteen or more director boards, each board having two CPUs, and passing data and other information between controller memory and either a series of disk drives or connected host computers.
Each CPU and director board could have its own time stamp, for example, for tracing the activity of the system during any board or system failure, and until recently the director boards typically kept their own time since they were all initialized at the same time. However, the commonly initiated clocks, would soon drift apart, and fail to remain synchronous. In the SYMM 4 version of the EMC Symmetrix system, there was implemented a hardware/software solution to provide a common time stamp for all of the microprocessors of the system, within, for example, one microsecond. One of the director boards became a master board which sent a clock signal out on a xe2x80x9cclockxe2x80x9d line or bus available to all other boards. The clock signal was used to increment, after a common initialization process, all of the CPU clock counters. If, for any reason, the clock signal was not available on the clock line, the processors/boards would switch, internally, to a local clock and would thus fall out of synchronization with each other.
When operating without the common clock, the processors, shortly after they were initially started, even if they were started at the same instant, would not remain synchronous. As a result, execution times of the same process would vary even if they were intended to start at the same time and the internal counters identifying clock time would drift so that, for example, in the event of system failure, it would generally not be possible to precisely determine the correct order of events.
In the Symmetrix SYMM 4 system, the solution of providing timing circuitry across the entire system, and the initialization thereof, was performed solely in hardware, and essentially provided a zero difference in time among the various processors. In this system, because a common clock pulse was provided to all units, the units would always clock together, even if the common clock pulses were not precisely periodic or precisely at the frequency called for. Thus, a single clock (and resynchronization) line was provided on the backplane, with clock counters at each director board incrementing on, for example, the rising edge of a common clock pulse. Further, the director boards were periodically resynchronized (using the clock line), and checks were performed to ensure that the system was within about five microseconds synchronization.
In this manner, the synchronization of this system, to provide an effective trace routine among the multiprocessors of the system, was effectively improved. However, should a fault occur so that no clock was provided on the clock line, the processors switched to their internal clocks, and the system continued to operate though without the advantage of a precise trace routine. This was not, however, detrimental to overall operation of the system but merely made troubleshooting somewhat more tedious and difficult.
Thus, if the external common clock is unavailable, for more than approximately five microseconds, a synchronization event was declared, the local counter was reset, and if the problem persisted, the processors switched to their own internal clocks. Thus, shorts, opens, a xe2x80x9cdead masterxe2x80x9d, etc. which could have resulted in a lack of a clock signal, were not a significant failure, would not take the system down, and only affected the trace program.
Nevertheless, in the process of improving the disk controller, it became clear that the microcode of each director board began to use and rely upon the clock signal and the resulting counter clock time for scheduling. As a result, the failure to provide the external, common clock signal, and to lose synchronization, now could have a substantial deleterious effect on operation of the system. As a result, in the SYMM 4 version of the EMC system, when the hardware detected a missing or xe2x80x9cdeadxe2x80x9d clock on the common clock line, it would generate a high level interrupt to the processor. If the microprocessor based code confirmed that the clock was missing or xe2x80x9cdeadxe2x80x9d, it then declared a synchronization event, switched to its internal clock, and modified the scheduling of the scheduled events as appropriate.
The invention advantageously provides a method and apparatus for improving the use of clock synchronization in a multiprocessing disk controller system in which clock time across a plurality of units becomes important. Other advantages of the system are a more reliable operating system and platform, more reliable trace scheduling and hence better tracing during a failure mode, and the ability of the microcode to rely upon the clock for scheduling and other activities.
The invention relates to a disk drive controller having a plurality of director elements. Each director element is able to control the flow of data therethrough and is responsive to external clock signals to synchronize its internal clock timing. The disk drive controller features a first master bus and a secondary master bus, each bus being connected to each director element, and each director element having circuitry for monitoring the occurrence of clock pulses over the buses and circuitry for switching from the master bus to the secondary bus for the receipt of clock pulses upon the occurrence of a failure of clock pulses over the master bus.
In particular embodiments of the invention, each director has a counter responsive to each received clock pulse for incrementing its count, a switch for selecting from which bus to receive the clock pulses, a hardware circuitry for identifying a first low threshold failure of clock pulses on the first master bus and for effecting a synchronization event in response thereto wherein the counter is reset, and a microcoded processor for deciding whether to cause the switch to the secondary bus for receiving clock pulses.
The method of the invention relates to controlling the flow of data through director elements of a disk drive controller, and being responsive to external clock signals to synchronize the internal clock timing of the director elements. The method features providing a first and a second master bus, connecting each bus to each director element, monitoring at each director element the occurrence of clock pulses over the buses, and switching from a first master bus to a second master bus for receipt of clock pulses upon the occurrence of a failure of clock pulses over the master bus.
The method further features determining, by consensus of the directors, whether a clock failure has occurred on a particular bus. The method also features employing the clock synchronizing signals over the master bus for internal operations.
Accordingly, the invention advantageously provides for failure of a first external clock generating mechanism so that a plurality of directors can remain in synchronism even if there is a failure of clock pulses over a first bus. The invention also advantageously enables the director elements to schedule operations in accordance with a master clock time related to the clock times of all other directors in the system.