This invention relates to an information processing system which is capable of processing an error which might occur in a main memory. Although description will mainly be made about the information processing system comprising a plurality of processors, this invention is also applicable to a system comprising a single processor alone.
A conventional information processing system of the type described is disclosed by Wollum et al in U.S. Pat. No. 3,812,468 and is known as a multiprocessing system which comprises a plurality of processing groups. Each of the processing groups comprises a memory module and various kinds of processors, such as communication processors, I/O control units, a diagnostic logic processor, all of which are accessible to the memory module. When a malfunction occurs in units of the processing groups, as in the memory modules, a faulty one of the units is isolated or disconnected from the multiprocessing system with normal ones of the units left in the system. Under the circumstances, reconstruction or reconfiguration of the system is made by the use of the normal units. However, no consideration is made about partial disconnection of the memory module.
Recently, a very high speed computer system (so-called a super computer system) has been developed which can process a great number of data signals, such as vectors, at a high speed. In general, the super computer system is similar in structure to the multiprocessing system mentioned above and comprises a main memory, a plurality of processors, and an access control device between the main memory and the processors. The main memory comprises a plurality of memory units and a common control section operable to control the memory units in common.
With this structure, each of the processors is operable as a request source and accessible by an access operation to each of the memory units under control of the access control device through the common control section of the main memory. Such an access operation is carried out by specifying one of accesses consecutively assigned to the memory units of the main memory. In this event, a reply is returned back to the request source from the main memory.
When an error occurs in the reply obtained by accessing one of the addresses of the main memory, the reply is processed as an erroneous reply in the request source with the address in question kept in the request source as a faulty address. The faulty address is included in a faulty one of the memory units. Upon detection of the erroneous reply, the instruction under consideration is retried by the request source so as to access the faulty address again and to recover the error, if it is retriable. Thereafter, the faulty memory unit is disconnected or isolated with reference to the faulty address kept in the request source from the super computer system and reconstruction or restructure of the main memory is made by the use of the remaining memory units, when such an error is detected again as a result of the retry.
It is mentioned here that a fault or malfunction is not always restricted to the faulty memory unit but is often spread into any other memory unit or units. In other words, such a malfunction might be spread over a plurality of the memory units. However, the malfunction of the plurality of the memory units has not been detected in the super computer system.
In addition, an error detected by the request source may result from a malfunction of the common control section of the main memory. Under the circumstances, it is preferable to distinguish between a fault or malfunction of the memory unit or units and a fault or malfunction of the common control section. However, such distinction has never been made between the malfunction of the memory unit or units and the malfunction of the common control section in the above-mentioned super computer system.
Therefore, such a malfunction of the plurality of the memory units and/or the common control section must be detected at every access operation of each processor even after the reconstruction of the system. This shows that an error might take place from the malfunction of the plurality of the memory units or the common control section even after the system is restructured by detection of an error in a certain memory unit. As a result, an invalid or useless access operation has frequently been carried out in each of the processors, which brings about a useless reconstruction of the main memory. This is also true of a multiprocessing system as mentioned before.