1. Field
The present disclosure relates to a multi-chip module (MCM) that accommodates semiconductor chips. More specifically, the present disclosure relates to an MCM that provide fault tolerance by using redundant components and semiconductor chips.
2. Related Art
In the next few years, high-performance computing (HPC) systems with petaflops of computing power and petabytes of storage will be replaced with ‘exascale’ systems. With the deployment of exascale systems comprising hundreds of thousands of interconnected processors, orders of magnitude of additional performance is expected. This computational and storage power in a single system will be equivalent to the collective computing power of the Top-500 supercomputers which currently exist.
One goal for HPC systems is to provide an extremely high level of reliability, availability and serviceability (RAS). In order to achieve high RAS, yearly downtime will need to be minimized. Consequently, fault tolerance and fault management are important considerations in the design of HPC systems. In particular, as the complexity of computer systems grows, achieving high RAS in HPC systems will involve one or more of the following: a scalable architecture for maximum performance and throughput; component, package, and integration-level reliability; interconnect technology with link-level reliability; elimination of single points of failure; fault tolerance; thermal management; and scalable software.
Recently, engineers have proposed using a multi-chip module (MCM) (which is sometimes referred to as a ‘macrochip’) to integrate a collection of semi-conductor chips together in an HPC system. An MCM can offer unprecedented computational density, energy efficiency, bisection bandwidth and reduced message latencies. These characteristics can be obtained by photonically interconnecting multiple silicon chips into a logically contiguous piece of silicon. This interconnection technique can be used to integrate various computer-system components, such as: multi-core, multi-threaded processors, system-wide interconnects and dense memories. However, the complexity of the MCM and the associated large number of integrated components can give rise to additional failure modes, which can increase failure-in-time (FIT) rates, and can thereby degrade RAS.
Hence, what is needed is an MCM without the above-described problems.