For a long time, engineers have sought to improve system reliability. Major progress has been made, particularly through more-reliable components. Suitable logical and technological organization of the system provides another way of avoiding errors even if these components become defective.
However, regardless of how effective the means used are, the possibilities of masking defects by error correction are not unlimited. This limitation becomes particularly critical when the complexity of the system increases.
To overcome this problem, the idea of providing the ability to replace defective elements in the system has arisen.
It will be appreciated that the disturbance caused by such repairs must be kept to a minimun, because it affects the availability of the system.
In addressing the problem of availability, the various elements that comprise a data processing system must be considered. A system essentially includes a certain number of units of three types: processors, memory modules, and input-output controllers. Generally, a plurality of processors that communicate via a bus with a plurality of memory modules are provided. The processors can be connected to the bus directly, or via a controller serving as an interface. To enable communication with the outside, the processors are also connected to one or more input-output units. These essential elements are generally also accompanied by a maintenance device, typically known as a "service processor", which is used for initializing the system and for maintenance, for example for taking errors detected in the various units into account.
In a system with multiple elementary processors (that is, a multiprocessor system), the failure of one of the processors does not necessarily cause the immediate interruption of the system. In fact, if persistent errors in a processor are detected by the service processor, then the service processor can logically disconnect the defective processor. As a result, the system can continue to function using the remaining processors, although with some degradation in performance. The maintenance service must later replace the defective processor with a replacement processor and must effect its logical reconnection. These operations are feasible, because current systems are typically designed to be capable of reconfiguration.
In a well-designed system, the failure of a processor and its replacement do not engender major visible disturbance to the user. In fact, because of redundancy, functioning is not interrupted, and the process being executed in the defective processor at the moment of the failure can be re-executed. The failure of a memory module, contrarily, presents an entirely different problem, because the defective module may contain data that are impossible to reconstruct. The problem may be even more serious, if the data relate to the system itself. Even if the operating system is designed so that the contents of the memory are periodically saved in external memories, a module containing the most recently updated data may fail before a save operation of this kind has been executed.
To reduce this risk, current memory modules include a plurality of components, such that each of the bits comprising one technological word are stored in a different component. This makes the probability of failure in any two bits of a word equal and independent, thereby enabling the use of a self-correcting mode of the Hamming type, memorized in supplementary components. Thus the failure of one or more components can be detected and corrected.
However, over the life of the module, the failures may accumulate until it is no longer possible to correct them. Hence the module is suitably replaced, before this limit is reached. Nevertheless, it must be noted that useful data may be memorized in that module. To the extent that the operating system permits, one solution may comprise recopying the contents of the defective module into an external memory. After replacement of the module, the saved data is reloaded. However, this solution is difficult to implement, particularly in the case where the defective module contains elements of the operating system.
Another solution may comprise transferring the data from the defective module to one or more other modules of the system. However, that method necessitates a reallocation of memory space, which means a complication of the software that has to manage the tables for address correspondence for the memory space involved.