This invention relates to a computer system architecture to improve processor die yields and system reliability. In particular, it relates to a system architecture providing redundant processors on a single chip, coupled via I/O (input/output) controllers to I/O pin interfaces, with redundant I/O controllers also being provided relative to the number of I/O interfaces.
In chip designs in use today, multiple processors may be formed on a single die, along with other circuitry, such as on-board cache, I/O logic and connectors at the edge of the chip, and I/O controllers controlling the flow of data between the I/O interfaces and the processors. For instance, in a typical system in use today as shown in FIG. 1, a number of processors (such as the four processors 10–40 shown) are formed on a die, along with crossbar circuitry 60 coupling the processors to input-output controllers (IOCs) 70–90. The IOCs communicate with I/O interfaces 160, with one IOC coupled to each pair of I/O interfaces, as shown.
The I/O interfaces 100–150 are connected to pins 160 along an edge 170 of chip 10, for coupling to a circuit board, such as the motherboard of a workstation or server.
In such a system, failure of any one of the processors will make the chip useless, since in general the system in which the chip is used will depend upon having all four (in this example) processors available.
Likewise, if a given I/O controller on the chip fails, then a path from a processor to an I/O interface is made unavailable, and the chip becomes worthless. Thus, each processor and each IOC forms part of a critical functional unit, which requires replacement of the chip upon failure.
In semiconductor chip manufacture, a certain nonzero failure rate is inevitable, and regular memory arrays are already routinely repaired. As circuit design becomes more and more complex, and electronic features are shrunk further, the probability of faults in the processed wafers increases, and at the same time the price of failure for a given wafer or die also increases.
It is therefore becoming more important that architecture features be developed that deal with these factors, and in particular that minimize the high penalties associated with the failure of circuit components on a die, namely the discard of the die and associated computational and economic loss. For example, current processors are vulnerable to die loss if a defect occurs in 70–80% of the die area, since the reparable memory arrays take up the rest of the die.