1. Field of the Invention
This invention relates to a data processing apparatus which includes a plurality of multiple processing units (CPUs) which perform a same operation and execute processing while continuously comparing the outputs of the processing units with each other to confirm that the processing units are performing the same operation.
2. Description of the Related Art
In recent years, it is a common practice to provide, in a CPU (processing section), built-in RAMs which are used as, for example, a cache memory and a TLB (Translation Look-aside Buffer) in order to achieve high speed processing of a data processing apparatus. However, the RAM exhibits a higher frequency of failure occurrences than other circuit units constituted from gates. Further, the RAM sometimes suffers from a temporary failure (bit inversion error) by alpha rays or noise.
Meanwhile, in order to assure a high degree of reliability as a data processing apparatus, it is a common practice to provide dual CPUs in a data processing apparatus such that the CPUs perform a same operation and execute processing while continuously comparing the outputs of the CPUs with each other to confirm that they are performing the same operation.
In a data processing apparatus which includes dual CPUs in this manner, if a trouble (software error; this may be hereinafter referred to as built-in RAM error) of such a built-in RAM as described above occurs in only one of the CPUs, the two CPUs naturally operate in different manners, and consequently, they output different values from each other from respective output pins thereof, resulting in synchronism error.
Conventionally, several countermeasures against such synchronism error are available including a countermeasure wherein the entire system is stopped in order to repair the failed portion by exchange of the hardware and another countermeasure wherein the CPU in which the failure of a built-in RAM has occurred is disconnected and processing is thereafter performed only with the other CPU.
In the former conventional countermeasure against synchronism error in a data processing apparatus, however, each time a built-in RAM error, which occurs in a comparatively high frequency, occurs, the worst situation, i.e., that the system is stopped, is invited. Consequently, the countermeasure has a subject to be solved in that it is inferior in terms of the reliability and the availability. Meanwhile, according to the latter countermeasure, the worst situation of the stopping of the system is not invited. However, since operation of the data processing apparatus is performed only with one of the two CPUs, the reliability is degraded accordingly.
While the frequency in occurrence of temporary failure of a built-in RAM of a CPU is generally high, contents of a cache memory can be recovered by reading out correct contents from a main storage unit (MSU) again, and also contents of a TLB can be recovered by starting address conversion again. However, where the cache memory is controlled in accordance with a write back (store in) control method, since the latest data is not sometimes held in the main storage unit, the contents of the cache memory cannot sometimes be recovered by the technique described above. However, the contents of the cache memory can still be recovered by restoring data using such a technique such as ECC (Error Checking and Correction).
Where dual CPUs are provided, however, if one of the CPUs detects a failure of its built-in RAM and starts recovery processing, then if no built-in RAM error occurs with the other CPU, then output values from the output pins of the two CPUs become different from each other, and consequently, the system becomes stopped.
In this manner, although a failure of a built-in RAM does not invite stopping of the system where a single CPU is provided since data can be recovered, it otherwise invites stopping of the system where dual CPUs are provided. In other words, although dual CPUs are provided in order to assure a higher degree of reliability of a data processing apparatus, a built-in RAM error, which occurs at a comparatively high frequency, conversely causes stopping of the system to occur frequently as a result of comparison (synchronism checking) of the outputs of the dual CPUs. As a result, the reliability and the availability of the data processing apparatus are deteriorated.