The present invention relates generally to the field of computer systems, and more particularly to error source identification on a time-of-day network.
Accurate timing is important to operating systems and hypervisors for workload management, and generally maintaining order of various events throughout a system. All processors in a symmetric multiprocessor system (SMP) must appear to have the same time. The processors are coupled by means of fabric buses that cooperate to process transactions for a shared resource, and hence require that the time-of-day (TOD) clocks on the processors be consistent to ensure the integrity of transaction data (i.e., time stamps accurately reflect the sequence of events). The TOD facility provides this capability by substituting a single “step” signal (from a designated “master” chip) for the individual TOD-clock-stepping signal oscillators in each chip. This eliminates variations caused by differences in TOD-clock-stepping rates. A “sync” signal from the “master” chip enables starting the TOD clock in each “slave” chip in synchronization with the master system, as well as continuously checking that this synchronization is being maintained. The step signal is generally generated from an external oscillator source, and the sync signal is generated by counting a certain number of steps.
When timing errors occur, it is important for diagnostics firmware to be able to analyze the system and determine with certainty the primary source of the error so appropriate action can be taken. Corrective actions may include repair of a component, de-configuration of selected resources to prevent the use of the selected resources, and/or a service call for replacement of a defective component if the component is a field replaceable unit that can be replaced with a fully operational unit.