In computer systems which include a plurality of processors and associated peripheral equipment, a mechanism is generally implemented to track the status of requests and responses transmitted throughout a network or complex of such processors and other equipment. This mechanism is generally implemented so that when a request for data is sent, a time limit is imposed on satisfaction of the request by a CPU (Central Processing Unit) based counter. If the request is not satisfied within the time limit, a counter generally times out thereby triggering a high priority machine check which causes the computer network or complex to shut down, and initiating execution of recovery code by the CPU. Generally, the counter starts timing when a request for a transaction wins arbitration and is placed on a bus toward a designated destination within the network or complex. The counter will generally stop either upon successful completion of the task being timed by the counter or upon expiration of the designated time period.
One problem with the above time-out mechanism is that a time out condition generally forces an entire network or complex of connected CPUs, to crash and lose all data associated with the system state in existence prior to the crash. In this situation, system recovery may be accomplished only at a very basic stage with much valuable data having been irretrievably lost due to the time out condition. Furthermore, the centralization of timer operation in the CPU leaves little data with which to identify a source or cause of the error which caused the time out condition. Accordingly, re-occurrence of the event causing the time out may be difficult to prevent.
In another prior art system, timers are added to various system chips in communication with CPUs within a complex of CPUs. Generally, when a time-out condition occurs in such a system chip, a state of the system chip which timed out, at the point in time when the time-out occurred, may be obtained, thereby providing information which may help to identify the cause of the failed transaction leading to the time-out condition. This approach generally provides more guidance in debugging a failure leading to a time out condition than systems employing only CPU-based timers. However, even with implementation of system chip based timers, a time-out condition will generally cause the entire network of computers to crash and lose all information associated with the machine state in existence just prior to the time-out condition. Accordingly, only a very basic recovery operation is available. And, as was the case with the previously discussed CPU-based timer approach, much data is irretrievably lost in upon occurrence of a time-out condition.
Another prior art approach involves implementation of a scalable coherent interface (SCI). SCI includes a networking protocol for retrying certain transactions upon expiration of timers associated with timed transactions. Therefore, instead of crashing the system upon timing out a first time, deployment of SCI protocol may be employed to retry transmission of a request for a which a response was not received in a timely manner. Thus, when a counter times out, the counter may be initialized to zero, the associated request re-transmitted, and the timer enabled to time the retried transaction. This approach may enable certain time out conditions to be avoided where a failure was caused by transient effects with the overall network or complex which do not reoccur during a retried transaction. However, certain problems associated with earlier mentioned approaches remain. Specifically, upon occurrence of a final time-out (for a transaction which will not be retried), the system will generally crash, and data associated with the machine state prior to the crash will generally be irretrievably lost. Accordingly, only a very basic recovery operation will be available.
The timing mechanisms employed in the prior art are generally neither synchronized nor coordinated with each other. Furthermore, the timing mechanisms are generally thinly scattered over a large number of devices, whether the timers are located exclusively in CPUs or are located in a combination of CPUs and system chips, such as memory and input/output (I/O) controllers. Accordingly, a fault in an area of a computer network or complex may go undetected until the problem is substantial enough to cause a widespread system shutdown.
Generally, in prior art systems employing timers distributed among various system chips, the timers generally have closely spaced time-out values. Accordingly, when a fault is encountered, a plurality of different timers may time out asynchronously in close temporal proximity to each other thereby causing the overall system to crash and making subsequent identification of the problem leading to the system crash very difficult. It is further noted that in the prior art systems described above, a time out condition in one CPU or in one system chip may cause an entire complex of CPUs and associated system chips to fail or crash, thereby enabling a failure in 1% of a complex to disrupt operation of 100% of the complex.
Therefore, it is a problem in the art that the machine state of a computer system is lost upon occurrence of a time out condition.
It is a further problem in the art that only a very limited recovery operation is possible after occurrence of a time-out condition.
It is a still further problem in the art that identifying the timer whose expiration caused a system crash may be very difficult in the systems of the prior art.
It is a still further problem in the art that a transaction failure and associated time out condition in one chip of a complex may cause the entire complex to crash or fail.
These and other objects, features and technical advantages are achieved by a system and method which deploys timers within devices in a distributed manner throughout a system or complex which includes CPUs and associated system chips, where the timers have a hierarchy of time-out values, and where the timers are able to independently experience time-out conditions generating a localized failure condition while enabling a remainder of the complex to continue operating. Preferably, a chip, device, or sub-system affected by the time-out or other error condition continues operating in a degraded or safety mode and communicates its condition to other chips and sub-systems so that the rest of the complex may continue operating while preferably bypassing the chip, device, or sub-system affected by the time-out condition.
The various timing operations preferably operate within a coordinated hierarchical structure wherein each timer monitors an operation occurring below its own level in a hierarchy while also being monitored by a device (which may be a timer) at a higher level in the hierarchy, where the higher level device (whether timer, CPU or other device) is generally able to monitor the timers below its level for a time period exceeding the time-out value of the timer being so monitored. In this manner, a time-out condition of a timer at one level in the hierarchy may be detected at the next higher level in the hierarchy thereby enabling the higher level device to respond to a time-out condition in a pre-determined and controlled manner, thereby enabling the higher level device to preserve its own data, preserve control over its own operation and beneficially isolate the error condition to the lower level device or system, thereby avoiding a shutdown of an entire complex or system.
Since the equipment affected by a time-out condition preferably continues operating, albeit in a degraded mode, during the time-out, and the rest of the complex may continue operating substantially normally, the complex is preferably able to preserve system state information which existed prior to the time-out condition and to continue processing information associated with the system state. Moreover, since the chip or device affected by the time-out continues operating after the time-out, and is able to communicate its condition to other chips and/or devices in communication with the chip or device which has timed out, information pertaining to the cause of the time-out condition may be effectively gathered by the complex, thereby preferably aiding a subsequent debugging process.
In a preferred embodiment, a plurality of timers associated with various chips, devices, and sub-systems throughout a complex of connected CPUs operate independently of one another enabling most timers within the complex to continue operating unhindered even while one timer within the complex experiences a time-out condition. Depending upon the severity of the condition of the malfunctioning (timed out) device or sub-system, the malfunctioning device may operate to isolate itself from communication with other systems within the complex by responding to communication messages from other devices with a message indicating that a fault condition exists. In this manner, the inventive mechanism preferably operates to limit propagation of any error or fault condition to systems or devices which did not experience a time-out condition.
In a preferred embodiment, a hierarchy of time-out periods may be established for a range of possible transactions within the complex of connected CPUs. The implementation of a hierarchy of time-out periods preferably avoids having a plurality of different timers time out simultaneously. Generally, whenever a timer at one level of the hierarchy experiences a time-out condition, other timers within the complex will preferably continue operating and simultaneous time-outs may thereby be avoided. Preferably, the devices associated with counters at various levels of the time-out value hierarchy are notified of a time-out condition occurring in connection with a device or transaction at a lower ranking level of the hierarchy. Preferably, after a timer times out, the timer or other device monitoring the timed-out timer halts its timing operation in a controlled manner without timing out. A device (whether timer or other device) monitoring a timed out timer, may alternatively continue its timing operation.
In a preferred embodiment, recovery code is generally executed in response to a time-out condition. Time-out thresholds may be selected based on factors including the nature of a device initiating a transaction within the system or complex, and the nature of the transaction being conducted. Accordingly, recovery code to be executed in response to a time-out condition may be customized to suit the particular device and the particular transaction associated with the applicable time-out condition.
Therefore, it is an advantage of a preferred embodiment of the present invention that the overall system or complex may continue operating after occurrence of a time-out condition within the complex.
It is a further advantage of a preferred embodiment of the present invention that system data is generally preserved during a time-out condition, and recovery from the condition may be accomplished through execution of appropriate recovery code.
It is a still further advantage of a preferred embodiment of the present invention that a location of a failure associated with a time-out condition is readily identifiable.
It is a still further advantage of a preferred embodiment of the present invention that generally only one time-out condition will arise in connection with one operating error, thereby avoiding the confusion arising from having a number of simultaneous time-out conditions.
It is a still further advantage of a preferred embodiment of the present invention that recovery code to be executed in response to a particular time-out condition may be tailored to suit a device and/or a transaction associated with the time-out condition.
It is a still further advantage of a preferred embodiment of the present invention that the complex or system may continue operating after occurrence of a time-out condition.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.