This invention relates to a multiprocessor computer system in which interconnected processor modules provide multiprocessing (parallel processing in separate processor modules) and multiprogramming (interleaved processing in one processor module).
This invention relates particularly to a system which can support high transaction rates to large on-line data bases and in which no single component failure can stop or contaminate the operation of the system.
There are many applications which require on-line processing of large volumes of data at high transaction rates. For example, such processing is required in retail applications for automated point of sale, inventory and credit transactions and in financial institutions for automated funds transfer and credit transactions.
In computing applications of this kind it is important, and often critical, that the data processing not be interrupted. A failure of an on-line computer system can shut down a portion of the related business and can cause considerable loss of data and money.
Thus, an on-line system of this kind must provide not only sufficient computing power to permit multiple computations to be done simultaneously, but it must also provide a mode of operation which permits data processing to be continued without interruption in the event some component of the system fails.
The system should operate either in a fail-safe mode (in which no loss of throughput occurs as a result of failure) or in a fail-soft mode (in which some slowdown occurs but full processing capabilities are maintained) in the event of a failure.
Furthermore, the system should also operate in a way such that a failure of a single component cannot contaminate the operation of the system. The system should provide fault-tolerant computing. For fault-tolerant computing all errors and failures in the system should either be corrected automatically, or if the failure or error cannot be corrected automatically, it should be detected, or if it cannot be detected, it should be contained and should not be permitted to contaminate the rest of the system.
Since a single processor module can fail, it is obvious that a system which will operate without interruption in an on-line application must have more than one processor module.
Systems which have more than one processor module can therefore meet one of the necessary conditions for non-interruptible operation. However, the use of more than one processor module in a system does not by itself provide all the sufficient conditions for maintaining the required processing capabilities in the event of component failure, as will become more apparent from the description to follow.
Computing systems for on-line, high volume, transaction oriented, computing applications which must operate without interruption therefore require multiprocessors as a starting point. But the use of multiprocessors does not guarantee that all of the sufficient conditions will be met, and fulfilling the additional sufficient conditions for on-line systems of this kind has presented a number of problems in the prior art.
The prior art approach to uninterrupted data processing has proceeded generally along two lines--either adapting two or more large, monolithic, general purpose computers for joint operation or interconnecting a plurality of minicomputers to provide multiprocessing capabilities.
In the first case, adapting two large monolithic general purpose computers for joint operation, one conventional prior art approach has been to have the two computers share a common memory. Now in this type of multiprocessing system a failure in the shared memory can stop the entire system. Shared memory also presents a number of other problems including sequencing accesses to the common memory. This system, while meeting some of the necessary conditions for uninterruptible processing, does not meet all of the sufficient conditions.
Furthermore, multiprocessing systems using large general purpose computers are quite expensive because each computer is constructed as a monolithic unit in which all components (including the packaging, the cooling system, etc.) must be duplicated each time another processor is added to the system even though many of the duplicated components are not required.
The other prior art approach of using a plurality of minicomputers has (in common with the approach of using large general purpose computers) suffered from the drawback of having to adapt a communications link between computers that were never originally constructed to provide such a link. The required links were, as a result, usually made through the input/output channel. Connections through the input/output channel are necessarily slower than internal transfers within the processor itself, and such interprocessor links have therefore provided relatively slow interprocessor communication.
Furthermore, the interprocessor connections required special adapter cards that added substantially to the cost of the overall system and that introduced the possibility of single component failures which could stop the system. Adding dual interprocessor links and adapter cards to avoid problems of critical single components failures increased the overall system cost even more substantially.
Providing dual links and adapter cards between all processors generally became very cumbersome and quite complex from the standpoint of operation.
Another problem of the prior art arose out of the way in which connections were made to peripheral devices.
If a number of peripheral devices are connected to a single input/output bus of one processor in a multiprocessor system and that processor fails, then the peripheral devices will be unavailable to the system even though the failed processor is linked through an interprocessor connection to another processor or processors in the system.
To avoid this problem, the prior art has provided an input/output bus switch for interconnecting input/output busses for continued access to peripheral devices when a processor associated with the peripheral devices on a particular input/output bus fails. The bus switches have been expensive and also have presented the possibility of single component failure which could down a substantial part of the overall system.
Providing software for the prior art multiprocessor systems has also been a major problem.
Operating systems software for such multiprocessing systems has tended to be nonexistent. Where software had been developed for such multiprocessor systems, it quite often was restricted to a small number of processors and was not adapted for the inclusion of additional processors. In many cases it was necessary either to modify the operating system or to put some of the operating system functions into the user's own program -- an expensive, time-consuming operation.
The prior art lacked a satisfactory standard operating system for linking processors. It also did not provide an operating system for automatically accommodating additional processors in a multiprocessing system constructed to accommodate the modular addition of processors as increased computering power was required.
A primary object of the present invention is to construct a multiprocessor system for on-line, transaction-oriented applications which overcomes the problems of the prior art.
A basic objective of the present invention is to insure that no single failure can stop the system or significantly affect system operation. In this regard, the system of the present invention is constructed so that there is no single component that attaches to everything in the system, either mechanically or electrically.
It is a closely related objective of the present invention to guaranee that every error that happens can be either corrected, detected or prevented from contaminating the system
It is another important objective of the present invention to provide a system architecture and basic mode of operation which free the user from the need to get involved with the system hardware and the protocol of interprocessor communication. In the present invention every major component is modularized so that any major component can be removed or replaced without stopping the system. In addition, the system can be expanded in place (either horizontally by the addition of standard processor modules or in most cases vertically by the addition of peripheral devices) without system interruption or modification to hardware or software.