1. Field of the invention
This invention relates to the field of integrated circuits. More particularly, this invention relates to the detection of errors including both random errors and systematic errors and the recovery from such errors within processing stages of an integrated circuit.
2. Description of the Prior Art
It is known to provide integrated circuits that can be considered to be formed of a series of serially connected processing stages (e.g. a pipelined circuit). Between each of the stages is a signal-capture element such as a latch or a sense amp into which one or more signal values are stored. The processing logic of each processing stage is responsive to input values received from preceding processing stages or elsewhere to generate output signal values to be stored in an associated output latch. The time taken for the processing logic to complete its processing operations determines the speed at which the integrated circuit may operate. If the processing logic of all stages is able to complete its processing operation in a short period of time, then the signal values may be rapidly advanced through the output latches resulting in high speed processing. The system cannot advance signals between stages more rapidly than the slowest processing logic is able to perform its processing operation of receiving input signals and generating appropriate output signals. This limits the maximum performance of the system.
In some situations it is desired to process data as rapidly as possible and accordingly the processing stages will be driven so as to advance their processing operations at as rapid a rate as possible until the slowest of the processing stages is unable to keep pace. In other situations, the power consumption of the integrated circuit is more important than the processing rate and the operating voltage of the integrated circuit will be reduced so as to reduce power consumption up to the point at which the slowest of the processing stages is again no longer able to keep pace. Both of these situations in which the slowest of the processing stages is unable to keep pace will give rise to the occurrence of processing errors (i.e. systematic errors).
One way of avoiding the occurrence of processing errors is to drive the integrated circuit with processing clocks having a frequency known to be less than the minimum permissible by a tolerance range that takes account of worst case manufacturing variation between different integrated circuits, operating environment conditions, data dependencies of the signals being processed and the like. In the context of voltage level, it is normal to operate an integrated circuit at a voltage level which is sufficiently above a minimum voltage level to ensure that all processing stages will be able to keep pace taking account of worst case manufacturing variation, environmental conditions, data dependencies and the like. It will be appreciated that the conventional approach is cautious in restricting the maximum operating frequency and the minimum operating voltage to take account of the worst case situations.
Besides systematic processing errors that result from the slowest of the processing stages being able to keep pace when a processor is run at too high a frequency or too low an operating voltage, integrated circuits are also subject to random errors known as single event upsets (SEUs). An SEU is a random error (bit-flip) induced by an ionising particle such as a cosmic ray or a proton in a device. The change of state is transient i.e. pulse-like so a reset or rewriting of the device causes normal behaviour thereafter. It is known to use error correction codes to detect and correct random errors. However, such error correction techniques necessarily introduce delay as a result of the processing time required for error detection and correction. This processing delay is justifiable in environments such as noisy communication channels where error rates are high yet it is important to suppress errors in the processed received data to within a predetermined error rate. By way of contrast, in the case of integrated circuits where it is generally desired to process data as rapidly as possible, it is undesirable to introduce error correction to critical paths of the data processing operations due to the delay and associated negative performance impact that error correction circuitry incurs.
Viewed from one aspect the present invention provides an integrated circuit for performing data processing, said integrated circuit comprising:
a plurality of processing stages, a processing stage output signal from at least one processing stage being supplied as a processing stage input signal to a subsequent processing stage, wherein said at least one processing stage comprises:
processing logic operable to perform a processing operation upon at least one coded input value to generate a processing logic output signal, said coded input value being an input value to which an error correction code has been applied;
a non-delayed signal-capture element operable to capture a non-delayed value of said processing logic output signal at a non-delayed capture time, said non-delayed value being supplied to said subsequent processing stage as said processing stage output signal following said non-delayed capture time;
a delayed signal-capture element operable to capture a delayed value of said processing logic output signal at a delayed capture time later than said non-delayed capture time;
error correction logic operable to detect an occurrence of a random error in said delayed value of said processing logic output signal, to determine if said detected random error is correctable using said error correction code and to either generate an error-checked delayed value or to indicate that said detected random error is not correctable;
a comparator operable to compare said non-delayed value with said error-checked delayed value to detect a change in said processing logic output signal at a time following said non-delayed capture time, said change being indicative of a systematic error whereby said processing logic has not finished said processing operation at said non-delayed capture time or of a random error in said non-delayed value; and
error-repair logic operable when said comparator detects said change in said processing logic output signal to perform an error-repair operation suppressing use of said non-delayed value either by replacing said non-delayed value by said error-checked delayed value in subsequent processing stages or by initiating repetition of said processing operation and processing operations of subsequent processing stages if said error correction logic indicates that said detected random error is not correctable.
The present technique recognises that the operation of the processing stages themselves can be directly monitored to find the limiting conditions in which they fail. When actual failures occur, then these failures can be corrected such that incorrect operation overall is not produced. The advantages achieved by the avoidance of excessively cautious performance margins in the previous approaches compared with the direct observation of the failure point in the present approach more than compensates for the additional time and power consumed in recovering the system when a failure does occur. Deliberately allowing such processing errors to occur such that critical paths fail to meet their timing requirements is highly counter-intuitive in this technical field where it is normal to take considerable efforts to ensure that all critical paths always do meet their timing requirements.
Furthermore, the invention recognises that random errors in the delayed value may be detected and corrected by error correction logic deployed off the critical path of the data processing operations. Thus, when no systematic processing errors are detected by the comparator, the error correction logic has no adverse impact on the rapid progress of the computation. However, in the event that a processing error is in fact detected by the comparator, the delayed value available for use by the error repair logic to ensure forward progress of the computation is a reliable value on which a random error check and, where appropriate, random-error correction has been performed. Regardless of the presence of the error correction logic in the path of the delayed signal value, when processing errors are detected by the comparator, there will be a delay in the processing due to the need to perform the error-repair operation. Thus there is a surprising synergy between the provision of delayed signal-capture elements that enable repair of deliberately induced systematic processing errors and the application of error correction coding to correct random errors in the delayed signal values. The error correction logic provides the advantage of improving the reliability of the delayed value by detecting and correcting random errors without significantly delaying progress of the computation.
It will be appreciated that the processing operation performed by the processing logic could be a non-trivial processing operation that results in the value of the input signal changing relative to the value of the output signal, for example where the processing operation is a multiplication operation or a division operation with non-trivial operands. However, according to one preferred arrangement the processing operation performed by the processing logic is an operation for which the processing logic output signal is substantially equal to the processing stage input value when no errors occur in said processing operation.
For example, according to a first preferred arrangement the data processing operation that does not ordinarily change the input value could be read or write operation performed by a memory circuit. According to an alternative preferred arrangement the at least one processing stage is performed by a register and said processing operation is a read, write or move operation. According to a further alternative preferred arrangement in which the output signal value should be equal to the input signal value, the at least one processing stage is performed by a multiplexer and the processing operation is a multiplexing operation.
Whilst the present technique is applicable to both synchronous and asynchronous data processing circuits, the invention is well-suited to synchronous data processing circuits in which the plurality of processing stages are respective pipeline stages within a synchronous pipeline.
It will be appreciated that a variety of different error correction codes could be used to error correction encode the input value input value, for example, linear block codes, convolutional codes or turbo codes. However, for arrangements in which the output value is substantially equal to the input value of the processing logic, it is preferred that the input value is error correction encoded using a Hamming code and the error repair logic performs said correction and said detection using said Hamming code. Hamming codes are simple to implement and suitable for detecting and correcting single bit errors such as those typically resulting from SEUs.
Although some preferred arrangements involve value-preserving processing operations such as read/write operations and data moving operations, in alternative preferred arrangements the processing operation performed by the processing logic is a value-altering operation for which the processing logic output signal can be different from said processing stage input value even when no errors occur in said processing operation. Thus the present technique is suitable for application to processing logic elements such as adders, multipliers and shifters.
In arrangements where the processing operation is a value-altering processing operation, it is preferred that the input value is error correction encoded using an arithmetic code comprising one of: an AN code, a residue code, an inverse residue code or a residue number code. Such arithmetic codes facilitate detection and correction of random errors in processing operations involving arithmetic operators.
It will be appreciated that the comparator alone could be relied upon to detect the presence of systematic errors. However, in preferred arrangements the integrated circuit comprises a meta-stability detector operable to detect meta-stability in the non-delayed value and trigger the error-repair logic to suppress the use of the non-delayed value if found to be meta-stable.
Having detected the occurrence of a systematic error, via the comparator, there are a variety of different ways in which this may be corrected or compensated. In one preferred type of embodiment the error-recovering logic is operable to replace the non-delayed value with the error-checked delayed value as the processing stage output signal. The replacement of the known defective processing stage output signal with the correct value taken from the error-checked delayed value sample is strongly preferred as it serves to ensure forward progress through the data processing operations even though errors are occurring and require compensation.
A preferred arrangement is one in which the error-repair logic operates to force the delay value to be stored in the non-delay latch in place of the non-delayed value.
Whilst the present technique is applicable to both synchronous and asynchronous data processing circuits, the invention is well suited to synchronous data processing circuits in which the processing operations within the processing stages are driven by a non-delayed clock signal.
In the context of systems in which the processing stages are driven by the non-delayed clock signal, the error-repair logic can utilise this to facilitate recovery from an error by gating the non-delayed clock signal to provide sufficient time for the following processing stage to recover from input of the incorrect non-delayed value and instead use the correct error-checked delayed value.
In the context of embodiments using a non-delayed clock signal, the capture times can be derived from predetermined phase points in the non-delayed clock signal and a delayed clock signal derived from the non-delayed clock signal. The delay between the non-delayed capture and the delayed capture can be defined by the phase shift between these two clock signals.
The detection and recovery from systematic errors can be used in a variety of different situations, but is particularly well suited to situations in which it is wished to dynamically control operating parameters of an integrated circuit in dependence upon the detection of such errors. Counter intuitively, the present technique can be used to control operating parameters such that the system operates with a non-zero systematic error rate being maintained as the target rate since this may correspond to an improved overall performance, either in terms of speed or power consumption, even taking into account the measures necessary to recover from occurrence of both systematic and random errors.
The operating parameters which may be varied include the operating voltage, an operating frequency an integrated circuit body biased voltage (which controls threshold levels) and temperature amongst others.
In order to ensure that the data captured in the delayed latch is always correct, an upper limit on the maximum delay in the processing logic of any stage is such that at no operating point can the delay of the processing logic of any stage exceed the sum of the clock period plus the amount by which the delayed capture is delayed. As a lower limit on any processing delay there is a requirement that the processing logic of any stage should have a processing time exceeding the time by which the delayed capture follows the non-delayed capture so as to ensure that following data propagated along short paths does not inappropriately corrupt the delayed capture value. This can be ensured by padding short paths with one or more delay elements as required.
The present technique is applicable to a wide variety of different types of integrated circuit, such as general digital processing circuits, but is particularly well suited to systems in which the processing stages are part of a data processor or microprocessor.
In order to facilitate the use of control algorithms for controlling the operational parameters preferred embodiments include an error counter circuit operable to store a count of the detection of errors corresponding to a change in the delayed value compared with the non-delayed value. This error counter may be reached by software to carry out control of the operational parameters.
It will be appreciated that the delayed signal-capture element and non-delayed signal-capture element discussed above could have a wide variety of different forms. In particular, these may be considered to include embodiments in the form of flip-flops, D-type latches, sequential elements, memory cells, register elements, sense amps, combinations thereof and a wide variety of other storage devices which are able to store a signal value.
Viewed from another aspect, the present invention provides a method of controlling an integrated circuit for performing data processing, said method comprising the steps of:
supplying a processing stage output signal from at least one processing stage of a plurality of processing stages as a processing stage input signal to a subsequent processing stage, said at least one processing stage operating to:
perform a processing operation with processing logic upon at least one coded input value to generate a processing logic output signal, said coded input value being an input value to which an error correction code has been applied;
capturing a non-delayed value of said processing logic output signal at a non-delayed capture time, said non-delayed value being supplied to said subsequent processing stage as said processing stage output signal following said non-delayed capture time;
capturing a delayed value of said processing logic output signal at a delayed capture time later than said non-delayed capture time;
detect an occurrence of a random error in said delayed value of said processing logic output signal using error correction logic, to determine if said detected random error is correctable using said error correction code and to either generate an error-checked delayed value or to indicate that said detected random error is not correctable;
comparing said non-delayed value with said error-checked delayed value to detect a change in said processing logic output signal at a time following said non-delayed capture time, said change being indicative of a systematic error whereby said processing logic has not finished said processing operation at said non-delayed capture time or of a random error in said non-delayed value; and
when said change is detected, performing an error-repair operation using error-repair logic suppressing use of said non-delayed value either by replacing said non-delayed value by said error-checked delayed value in subsequent processing stages or by initiating repetition of said processing operation and processing operations of subsequent processing stages if said error correction logic indicates that said detected random error is not correctable.
The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.